| and Evaluation 
in EDUCATION, 


PSYCHOLOGY, 
and GUIDANCE 


Ro. VS 


1695 


July, 1966 
Copyright © 1964 by Holt, Rinehart and Winston, Inc. 
All Rights Reserved 
Library of Congress Catalog Card Number: 64—21407 


20089-0214 


Printed in the United States of America 


Preface 


This volume is a new general textbook designed to serve students and edu- 
cators who wish a more thorough study of the concepts involved in various 
aspects of measurement than is presented in the two earlier books by the 
author in collaboration with Dr. Torgerson. These books, Measurement 
and Evaluation for the Elementary School Teacher and Measurement and 
Evaluation for the Secondary School Teacher, emphasized a functional ap- 
proach to evaluation and included chapters on measurement, diagnosis, and 
corrective instruction in each of the major fields. The present text does not 
replace either of these books. 

Many texts on measurement devote only а few chapters to concepts, 
while the remaining sections are concerned with the applications of meas- 
urement in the schools. In this volume, almost every chapter focuses on the 
development of concepts. By attaining à thorough understanding of these 
concepts, students will be helped to select or develop measurement proce- 
dures appropriate to their purposes and to interpret measurement data with 
due respect for (1) the inevitable errors involved in any sampling procedure 
and (2) the limitations of indirect approaches typically used as efficient 
substitutes for direct study of behavior. 

Although applications have not been neglected, they are usually pre- 
sented for the purpose of helping students increase their understanding of 
concepts. That is, illustrative examples are given to help students learn to 
apply measurement techniques with discrimination in (1) improving the 
quality of their decision-making about students, and in (2) reaching more 
valid judgments about students’ abilities, interests, and personality traits. 

The following are examples of the ways in which this textbook has aimed 
at a high level of understanding of measurement concepts. 


vi Preface 


In Chapter 4, the many aspects of validity in measurement are presented in 
all their complexity. Students are given many examples to help clarify the four 
major types of validity and their significance for the decision-making processes 
in education. 

In Chapters 5 and 6, students are shown that available tests cannot be neatly 
classified into aptitude and achievement tests; illustrations are given of tests that 
represent different degrees of saturation with the verbal-educational factor. 
Instead of presenting separate chapters on different kinds of aptitude tests, all 
types have been considered together in Chapter 6 so that the student can see 
their similarities and differences. 

In Chapters 7 through 9 on interests and personality, a few published inven- 
tories are described, but they are presented as illustrative of different approaches 
to the complex problems of measurement in these areas. 

In Chapters 10 and 11 on teacher-made tests, students are asked not only to 
examine the traditional types of test items but to explore the full range of ob- 
jectives and test items that illustrate a taxonomy of the cognitive domain. Such 
an exploration may help teachers to measure progress toward objectives which 
have heretofore been neglected in their evaluation programs. 

Chapter 12, on the measurement of skills outcomes, has not been limited to 
a survey of published tests. The author has attempted to develop an understand- 
ing of the concepts involved in appraising student performance in the skills of 
communication, homemaking, industrial arts, and other fields. The techniques 
for evaluating products, as well as those for appraising “performance in process" 
are considered. Measurement of skills outcomes is neglected in the typical 
textbook. 

Statistics are introduced throughout the textbook as needed in meaningful 
problem situations. The emphasis is on concept development rather than on the 
development of computational skills. However, the professor who wishes to 
introduce more instruction in statistics into the measurement course will find 
additional materials included in the Appendix. 


Since the author believes that students taking a course in measurement 
should examine and appraise tests in their own subject fields, she has in- 
cluded in the Appendix a comprehensive, classified list of available tests, 
with references to critical reviews in the Buros yearbooks. This list should 
be of considerable value in project assignments. 

The typical measurement class is heterogeneous with respect to the stu- 
dents' readiness for the study of measurement, as well as with respect to 
their needs within this instructional area. The typical class includes under- 
graduate students and experienced teachers and administrators, as well as 
students preparing for work as counselors, psychometrists, or school psy- 
chologists. For this reason, the author has attempted to include materials 
for those interested in individual testing, vocational interest and aptitude 
testing, various approaches to personality assessment, as well as group test- 
ing and educational diagnosis. Moreover, she has attempted to keep the 
textbook readable for the average student, while meeting the needs of ad- 
vanced students through special tables, footnotes, and selected references. 

The author wishes to express her appreciation to Dr. T. L. Torgerson, 


Preface vii 


who served as consultant throughout the writing and revision of the manu- 
script; to two coworkers at California State College at Los Angeles, Dr. 
Edwin Wandt and Dr. Carleton Shay, for their critical review of several 
chapters; to Dr. Miriam Bryan of Educational Testing Service for her 
review of Chapters 11 and 13. The author would also like to acknowledge 
her indebtedness to Mrs. Johanna de Graff for her highly competent assist- 
ance and her conscientious attention to detail in the typing of the manu- 
script. Finally, the patience and cooperation of her husband and children 


are gratefully acknowledged. 
Georgia Sachs Adams 


Professor of Education 


Los Angeles 
May 1964 


Contents 


PREFACE У 


N 


PART ONE 


Basic Principles and Procedures 


Introduction 


Comparison of the Terms “Measurement” and “Evaluation” • The 
Basic Requirements of the Evaluative Process • The Теасћег5 
Role in Evaluation + Illustrative Problems in Measurement and 
Evaluation * Summary Statement * Selected References + Dis- 
cussion Questions and Suggested Activities 


Interpreting Test Data in Terms of Converted Scores 


Converted Scores Based on Comparison with a Perfect Score • 
Converted Scores Based on Comparisons Among Examinees * De- 
fining Norming Populations and Selecting Norm Samples * Use 
of Different Types of Converted Scores in Computation * Sum- 
mary Statement * Selected References * Discussion Questions 
and Suggested Activities 


Reliability 


Interpreting Test Scores in Terms of Sources of Variance * 
Computing Correlation Coefficients * Comparison of Standard 
Errors and Reliability Coefficients as Measures of Reliability • 
Methods of Estimating Reliability of Test Scores • Reliability of 
Difference Scores * Factors Affecting the Size of Reliability 
Coefficients * Improving the Reliability of Test Scores • Sum- 
mary Statement + Selected References * Discussion Questions 
and Suggested Activities 


17 


68 


ix 


Contents 


Validity 


Tests as Direct or Indirect Measures of Criterion Behavior * 
Types of Judgments Made on the Basis of Test Results * Content 
Validity * Concurrent Validity * Predictive Validity * Construct 
Validity * Summary Statement * Selected References * Discus- 
sion Questions and Suggested Activities 


Application of the Principles of Measurement 
In the Selection of Tests 


Types of Tests Available * Evaluating Tests for Use for Specific 
Purposes * Illustrative Use of the Summary Form with a 
Standardized Test * Sources of Information about Published 
Tests * Summary Statement * Selected References * Discussion 
Questions and Suggested Activities 


PART TWO 


103 


149 


The Study of Individuals 


The Measurement of Aptitudes 


The Concepts of Aptitude and Achievement * Tests of General 
Mental Ability or Scholastic Aptitude * Multiscore Tests of 
Mental Abilities and Aptitude Test Batteries * Tests of Special 
Aptitudes * Prognostic Tests * Purposes for Which Aptitude 
Tests Are Used + Interpretation of Results from Scholastic 
Aptitude Tests * Summary Statement + Selected References * 
Discussion Questions and Suggested Activities 


The Measurement of Interests and Attitudes 


The Nature of Interests * Types of Interest Inventories * Basic 
Interest Groups * Validity of Interest Inventories • Interpretation 
of Interest-Inventory Results * Measurement of Attitudes * 
Summary Statement * Selected References * Discussion Ques- 
tions and Suggested Activities 


181 


228 


10. 


11. 


Contents 


Informal Methods of Studying Personal-Social 
Adjustment 


The Nature of Personal-Social Adjustment * Personality De- 
scription * Sources of Data About the Personal-Social Adjust- 
ment of Individuals * Self-Report Techniques * Observation 
of Behavior * Obtaining the Opinions of Others: Teacher 
Rating Scales * Obtaining the Opinions of Others: Sociometric 
Techniques * Summary Statement * Selected References * 
Discussion Questions and Suggested Activities 


Personality Inventories and Projective Techniques 


The Psychometric Approach: Personality Inventories * Projective 
Techniques * Summary Statement. * Selected References * 
Discussion Questions and Suggested Activities 


PART THREE 


Тће Improvement of Instruction 


Development, Try-out, and Revision of 
Teacher-made tests 


Importance of Teacher-made Tests * Characteristics of a Good 
Teacher-made Test * Planning Tests for Greater Content 
Validity * The Advantages and Disadvantages of Essay and 
Objective Tests * The Construction of Test Items * Evaluating 
Objective Teacher-made Tests * Preparing a Teacher-made Test 
for Use • Statistical Analysis of Test Results + Teacher 
Cooperation in Test Development * Providing Leadership in the 
Development of Teacher-made Tests and Other Aids to Evalua- 
tion * Summary Statement * Selected References * Discussion 


Questions and Suggested Activities 


The Taxonomy of Educational Objectives and Test Items 
Illustrative of Its Major Categories 

1.00 Knowledge + 2.00 Comprehension * 3.00 Application * 
4.00 Analysis * 5.00 Synthesis * 6.00 Evaluation * Summary 
Statement • Selected References * Discussion Questions and 
Suggested Activities 


363 


xii 


12. 


13. 


14. 


135. 


Contents 


Evaluating Student Performance in the Skills 


Developing Tests of Skills Outcomes * Scoring Processes and 
Products * Illustrative Evaluation Techniques in the Communi- 
cation, Manipulative, and Athletic Skills * Validity and Relia- 
bility of Evaluations of Student Performance in the Skills • 
Summary Statement * Selected References + Discussion Ques- 
tions and Suggested Activities 


The Place of Standardized Achievement Tests 
in the Improvement of Instruction 


History of Achievement Testing * Uses of Standardized 
Achievement Tests * Leading Achievement Tests • Interpretation 
of Data from Achievement Testing Programs • Large-Scale 
Testing Programs • Summary Statement + Selected Refer- 
ences * Discussion Questions and Suggested Activities 


Educational Diagnosis 


Measurement as Basic to Educational Diagnosis and Individual- 
ized Instruction + Levels of Diagnosis + Steps in Educational 
Diagnosis * Group Diagnosis + Basic Principles of Corrective 
Instruction + Summary Statement + Selected References + 
Discussion Questions and Suggested Activities 


PART FOUR 


401 


428 


458 


Administrative, Supervisory, and Guidance Aspects 


of Measurement and Evaluation 


Planning and Administering the Evaluation Program 


Functions of the Evaluation Program + 


Characteristics of an 
Effective Evaluation Program + 


Planning the Evaluation 
Program * Guidance Workers and Psychologists „as Resource 


Persons + Planning the Testing Program + Administering the 
Testing Program + Summary Statement • Selected Refer- 
ences * Discussion Questions and Suggested Activities 


487 


16. 


17. 


Contents 


Summarizing, Recording, and Reporting Data 
about Individual Students 


Summarizing and Recording Data * Reporting Data to Students 
and Parents * Improving the Validity, Reliability and Com- 
parability of Teachers’ Marks • Summary Statement * Selected 
References * Discussion Questions and Suggested Activities 


Using Measurement Data in Individual 
and Group Guidance 


Guidance Responsibilities of Counselors and Teachers * Issues 
and Principles Involved in the Use of Measurement Data in Guid- 
ance + Guidance in Educational and Vocational Planning * 
Combining Group and Individual Approaches in Helping High 
School Students in Self-Appraisal and Life Planning * Summary 
Statement * Selected References * Discussion Questions and 
Suggested Activities 


APPENDIXES 565 


INDEX 


641 


эг 


xiii 


507 


534 


РАЕТ ОМЕ 


The 
Evaluative 


Process 


1 Introduction 


For many students, this textbook will constitute their first introduction to 
measurement and evaluation. For this reason, we have tried to develop the 
basic concepts in measurement through the use of nonmathematical expla- 
nations and realistic examples. Other students using this textbook will 
have already had a unit in measurement as part of an introductory course 
in education or psychology. These students will find that this textbook 
reviews concepts they have studied but leads them on to a higher level of 
understanding. In fact, through the study of summary tables, footnotes, and 
chapter references, they will be able to pursue their interest in any topic 
beyond the limits of the textbook. They will find that the concepts of 
measurement and evaluation have implications for almost every aspect of 
teaching, guidance, and administrative work. 


EVERYDAY USES OF MEASUREMENT AND EVALUATION 
During the course of a school day, teachers, principals, and other school 
personnel make many decisions about students and help them to make 
many decisions for themselves. 

Decisions are best made on the basis of a good deal of information, and 
schools have cumulated considerable information about each student. The 
data that are helpful in grouping students for physical education, however, 
are not identical with the data that would be most helpful in deciding which 
students should be placed in accelerated classes in mathematics, We want 
data that are most relevant to each decision. 

Decisions usually involve prediction. The following questions are typical: 
On which athletic team or in which mathematics class is this student likely to 
make the greatest growth? Is this student likely to be admitted to the col- 


3 


4 THE EVALUATIVE PROCESS 


lege he plans to attend? The first of these two questions involves institu- 
tional decisions about students; the second involves providing data to the 
individual that will help him in making his own decision. In each case, we 
need relevant data, that is, data that will increase the accuracy of the judg- 
ments and inferences we make. 

In order to improve the decision-making process, we need measurement 
data about individuals. For decisions regarding grouping in physical edu- 
cation, for example, we need to measure height, weight, and the student's 
achievement in certain physical skills. When we measure height or weight, 
our only sources of error would be inaccuracy in the instrument of meas- 
urement and careless errors in reading the results. When we measure 
achievement in the skills, however, we face new problems. The person's 
speed of running varies somewhat from time to time; when we time him 
on one occasion, we obtain only a sampling of his running ability. We 
must recognize that there are fluctuations in individual performance. When 
we make an inference from one sample of a boy's running ability, sam- 
pling error is involved, The only way in which we could determine the amount 
of variation in running time from one sampling to another would be to 
check the variations in performance. The variation might be less with older 
students than younger ones; it might be less with trained runners than with 
typical students. 

When we measure such a skill as ability to throw baskets, we realize 
that we must standardize the testing conditions regarding distance from 
and height of the basket. When we measure batting skill, we are likely to 
find still greater variation in student performance from sample to sample 
because of the introduction of a new source of variation—the performance 
of the pitcher. Moreover, there is more subjectivity involved in scoring 
batting performance than there was in scoring "baskets." In making such 
subjective judgments, coaches agree with each other much more than would 
untrained observers, 

If we try to judge how well a student knows his spelling, arithmetic, or 
geography, we will get variations in his scores from one test to another 
because each test includes a different sampling of questions. From these 
examples, we see that (1) it is desirable to compare the performance of 
individuals under standard conditions and (2) we need to know the 
sources of error and the amount of error involved when we make judg- 
ments about an individual on the basis of a sampling of his behavior. 

Obtaining summary scores that can be recorded, combined, and inter- 
preted is simple for some dimensions of the individual and 
cult for others. We have no problem with height and wei 
baskets thrown can be objectively scored; umpires 
judging objectively “hits,” “fouls,” “strikes, 


extremely diffi- 
ght; number of 
are given training in 
" and “balls.” Tt is no problem 


Introduction 5 


to obtain “number right" or “percentage correct” on a test of spelling, 
history, or geography. 

For decisions on selection of students for an accelerated class or the 
classification of students into various groups, number right or rank within 
the total group may suffice. For other uses of measurement, we would like 
to know whether students have achieved as well, or made as much progress 
in a year, as other students of their age and grade. To answer such ques- 
tions, we need data concerning representative samples of students; we need 
to obtain age or grade norms or other types of data that aid in making 
meaningful comparisons of individuals and groups, and meaningful intra- 
individual comparisons, such as inferring that a student performs more 
adequately in mathematics than in social studies. Actually, when a teacher 
makes statements about a student behaving immaturely or acting like a 
younger child, he is interpreting a sample of the child's present behavior 
in terms of his own “norms,” that is, his cumulated observational data 
about students in various age groups. The sampling of students he has 
observed, however, may or may not have been representative. 


COMPARISON OF THE TERMS "MEASUREMENT" 
AND "EVALUATION" 


When a representative of the American Psychological Association was 
asked to define "psychological tests" at a Congressional hearing, he gave 
a definition that, although broad, includes all essential characteristics. His 
definition was: “Psychological tests are nothing more than careful observa- 
tions of actual performance under standard conditions." The term “care- 
ful” implies that the procedures for sampling the performance and ob- 
taining a record of it are systematic and objective enough that different 
observers would obtain reasonably comparable findings. 

This definition could also be used for the concept of measurement. 
When we obtain measures (other than such physical measures as height 
and weight), we obtain and record data on a sampling of performance 
under standard conditions. “Evaluation” goes beyond measurement in that 
value judgments are involved. We measure a student's abilities in differ- 
ent areas. When we interpret these scores in terms of standards for his 
grade, in terms of his educational or vocational plans, or some other basis 
for making value judgments, we are no longer restricting ourselves [0 
“measurement”; we are now “evaluating” his abilities or his progress. We 


1 “Report of Testimony at a Congressional Hearing,” American Psychologist, vol. 
13 (May 1958), 217-223. 


6 THE EVALUATIVE PROCESS 


measure a student's height; we evaluate it in terms of his goals (as jockey 
or high jumper). We measure a student's speed of reading; we evaluate it 
as unsatisfactory or satisfactory in terms of his age, previous experience, 
and educational goals. 

Sometimes measurement and evaluation seem to be inseparable. For 
example, it is difficult to compare two poems, two short stories, two sam- 
ples of handwriting, or two swimming strokes without rating them in terms 
of value judgments. We could measure poems with respect to number of 
words or length of line; our comparisons would be highly objective and 
consistent, but they would not be relevant to the goals of education. Hence, 
we choose to make a fairly subjective evaluation of the poems, rather than 
to measure them with respect to irrelevant dimensions. 


THE BASIC REQUIREMENTS OF THE EVALUATIVE PROCESS 


Since educators and psychologists are concerned chiefly with those meas- 
urements that can provide a basis for evaluation, it is well to get an 
overview of the Steps in any evaluative process. Although most of our 
examples are taken from tests, in the narrower sense of the term, the same 


processes apply to observation of behavior as a basis for ratings on selected 
characteristics. 


І. Determining what we wish to evaluate. The information we obtain should 
be relevant to the type of judgment we wish to make (for example, deter- 
mining the level of children's reading vocabulary as a basis for selecting 
textbooks, or their speed of reading as a basis for judging length of reading 
assignments). 

2. Defining what we wish to evaluate in terms of behavior. We need to define 
level of reading voc 
choose the correct s 


test items, should be sufficiently 
based on the student's performance on the sample would give a fairly 
accurate indication of his usual level of performance, for example, his 
vocabulary knowledge on all words at specified grade levels, or of his char- 
acteristic reading speed. In a speed-of-reading test, we would like to sample 
the different kinds of textbook-type material that children are expected to 
read at that grade level, rather than having the test composed entirely of 
science materials, which some students would read at а rate above, and 
others at a rate below, their characteristic reading rate. 


Introduction 7 


4. Getting a record. In paper-and-pencil tests the student provides his own 
record, which can be scored later. In the testing of physical skills, one might 
decide to record the performance on film; in the testing of pronunciation in 
foreign language, a tape recording might be utilized. In the evaluation of 
student characteristics (defined in behavioral terms), the teacher's observa- 
tions could be recorded in narrative form, with emphasis on the description 
of behavior and the avoidance of judgmental terms. 

5. Summarizing the evidence. As already explained, some evidence is easily 
summarized, for example, running time (in seconds), number of baskets 
made, and the like. In a vocabulary test, the score could be number of words 
for which the correct definition is selected; or we might decide to penalize 
the student for "incorrect guesses." Greater care is required in deciding how 
to summarize the data from a film on physical performance, a tape record- 
ing of a student's oral language performance, or a narrative report on 
children's behavior. We need to decide on the aspects to be "scored," the 
units to be used in scoring, and the like. 


The examples of measurement we have given in this chapter have been 
fairly simple ones. How much more complex it is to attempt to measure a 
child's mental ability or aptitude for school work, his readiness for reading, 
his comprehension of the principle of photosynthesis, his ability to inter- 
pret maps and graphs, his ability to communicate effectively through writ- 
ten or oral expression, his interest in different vocations, or his attitude 
toward school. 

Dyer feels that we must face up to the inherent complexity of educa- 
tional measurement and work on its problems with humility, persistence, 
and all the competence we can muster. 


I don't think the business of educational measurement is inherently simple 
-.. Any way you look at it, the measurement of human behavior is bound to 
be a terribly complex process, since the phenomena of human behavior are 
themselves as complex as anything in the universe. . . . 

. . . teachers [must] more clearly realize the fact that measurement in one 
form or another is not only an indispensable part of their job but that, like all 
the other parts, it is full of difficulties and unanswered questions which require 
constant study and hard thinking and a willingness to move ahead on the basis 
of highly tentative hypotheses. . . . 

On the one hand, we want them to become keenly aware of all the uncertain- 
ties in even the most careful measurement of pupil performance, and on the 
other hand, we want them to regard measurement as part and parcel of effective 
instruction. . . . 

I think we can do it by getting them to regard all their classroom work as a 
continuous series of both minute and longer range experiments in pupil learn- 
ing... Always central is the nagging question: "Did it work?" And this, of 
course, is where measurement gets into the act. . . . Of course, she can never 
know for certain whether her strategy is on the right track, since the instru- 


8 THE EVALUATIVE PROCESS 


ments and techniques on which she has to depend for checking her hunches 
and hypotheses about procedures are never wholly reliable or relevant. In most 
cases, I believe, the realization of uncertainty is achieved only after the cold 
steel of such ideas as sampling error, the variability of human behavior, and 
the fallibility of casual observation and personal judgment has entered the 
teacher's soul. She will learn to be sure that she cannot be sure, and accordingly 
her approach to the instructional task will become less rigid, more tentative 


and . . . more responsive to the individual learning needs of the pupils with 
whom she is confronted.? 


THE TEACHER'S ROLE IN EVALUATION 


Under the survey-testing approach to evaluation, which dominated the 
1920s, the teacher had little say in the selection of the tests used or the 
interpretation of test results. Subject-matter content and standards of 
achievement were centrally established, and testing was considered an ad- 
ministrative-supervisory function. 

The child-study approach, individualization of instruction, and the 
greater breadth and more functional nature of the modern curriculum have 
all modified the concept of evaluation. That is, the instructional and child- 
study uses of measurement and evaluation have tended to predominate in 
importance over the administrative-supervisory uses. Hence, the teacher 
now has a key role in the evaluation process. 

As the teacher has achieved a more significant role in planning the edu- 
cational experiences for his class, he has become responsible for appraising 
the worthwhileness of those experiences—the extent to which students are 
achieving educational goals. Furthermore, as the schools have committed 
themselves to the ideal of individualizing instruction and meeting student 
needs, so that all may learn more effectively, emphasis has been placed on 
accumulating and interpreting measurement data for individuals rather 
than for groups. Here again, because of his daily opportunities to obtain 
additional information about his students and to put the measurement data 
to use, the teacher is the key person. 

The teacher's job as an evaluator has two essential facets, neither of 
which can occupy his exclusive attention. One has its orientation in the 
group instructional program and the extent to which the class, as a whole, 
is achieving its goals; the other has its orientation in the study of individual 


* Henry S. Dyer, "What Point of View Should Teachers Have Concerning the 
Role of Measurement in Education," 75th Yearbook, National Council on Measure- 
ments Used in Education (New York: The Council, 1958), pp. 11-12. 


Introduction 9 


 students—in diagnosis of their growth lags and discovery of the important 
causal factors for such lags. Both approaches are significant aspects of 
evaluation and indispensable to good teaching. 

Although all of this textbook has been written with the objective of 
helping teachers to become more effective in their work in measurement 
and evaluation, Part Three is especially concerned with the contributions 
that can be made to the improvement of instruction. 


ILLUSTRATIVE PROBLEMS IN MEASUREMENT 
AND EVALUATION 


Measurement is an essential but difficult aspect of the educational program. 
If teachers are to persevere in becoming more informed and effective in 
this aspect of their teaching role, they need leadership, encouragement, and 
resource materials from their school administrator. Moreover, the school 
administrator has responsibilities to students, parents, and the community, 
which he cannot discharge effectively without the use of measurement 
techniques. 

Instead of listing administrative responsibilities in evaluation, let us ex- 
amine some of the problems faced by a hypothetical junior high school 
principal and his staff as they begin a new school year. These problems 
will be used as a basis for later discussion of the development of local 
norms, the basic concepts of reliability and validity, and many other 
concepts. 

Let us assume that Mr. Smith has met with his staff to list problems they 
will face during the school year. Mr. Smith sensed that he and his staff 
members were making many important decisions without having adequate 
information. He felt sure that test results could be used in many cases to 
increase the informational basis for their decision-making. Many published 
tests were administered in his school; but the results, he suspected, were 
often filed away and forgotten. Some teachers felt that tests were of little 
value; others had an unquestioning reverence for test scores, which did 
not lead to desirable caution in their interpretation. 


1. The first problem Mr. Smith raised was very important, yet very complex. 
He emphasized that, as principal, it was his obligation to find out, as best he 
could, whether students were making progress toward the major objectives of 
the educational program. He recognized the rights of individual teachers to 
decide on the specific learning activities and content through which important 
principles and concepts should be taught. He felt that it was clearly the teachers’ 
responsibility to measure in their own way how accurately students had learned 


10 THE EVALUATIVE PROCESS 


these specifics. How could he best discharge his basic responsibility without 
interfering with the freedom of teachers as professional workers to plan the 
day-by-day learning experiences of their students? Staff discussion showed that 
answering such a question involved (a) careful selection of tests that would 
measure growth toward the major goals of education, rather than memory of 
details and (b) wisdom in the way in which test results were interpreted and 
used. 

One staff member cautioned the principal about interpreting the comparative 
results for different teachers, not only because of differences between classes in 
average scholastic aptitude and background skills but also because of the diffi- 
culties inherent in measuring progress toward the less tangible objectives. Mr. 
Smith reassured his staff members that no one was asking him to rank his 
teachers in the order of their competence. His goal would be to develop 
hypotheses concerning classes that appeared to be doing a less-than-adequate 
job and then to test out these hypotheses through observation of teaching and 
а study of instructional and evaluation materials used. The assistance of super- 
visors and department heads would be needed, not only in checking upon his 
hypotheses concerning possible areas of weakness but in taking steps to help 
improve student achievement in any such areas. 

2. Another problem was concerned with the wide variation in grading prac- 
tices. Teachers make daily decisions concerning grades on homework, quizzes, 
products made in class, student participation in committee activities, and other 
aspects of student work. These data are summarized in grades that go home to 
parents, are recorded on students’ cumulative records, and are used in making 
many decisions about students. When grades serve as the basis for decisions 
regarding eligibility and scholarships, when they are used by the student and his 
counselor as a partial basis for judging strengths and needs, and when they are 
used by schools and colleges as an aid in admissions and grouping, the assump- 
tion is made that grades are reasonably comparable. 

The staff discussed what they could do to help teachers improve the com- 
parability of their grades, at least within subject fields. The counselor gave the 
example of John and Peter, identical twins, both college-bound, whose grades 
in the past had approximated a B+ average. Certainly it seemed indefensible 
when a "trick of fate" resulted in John's being assigned to a history class in 
which he received an "easy A" because the teacher's standards of grading were 
generous and the competition from classmates was minimal, while Peter re- 
ceived a C in a class where competition was keen and the teacher was firmly 
committed to a policy of no A's and few B's. 

Mr. Smith hesitated to interfere; yet, as principal, he knew that he and his 
staff should help teachers to do more fairly and effectively this job of evaluation 
and grading that was so important to students. Considerable Opposition was 
expressed to requiring teachers to "grade on the curve" 


in every class. Agree- 
ment was reached, however, on two principles: 


Introduction 11 


a. If students’ grades were to be assigned fairly, they must be assigned on 
the basis of students' ranks with respect to some composite score that re- 
flected their many achievements. 

b. These composite scores should not be arrived at in an intuitive manner but 
through such procedures that the teacher could explain his basis for 
grading to students and parents. 


3. A third problem was concerned with the advisability of using teaching 
machines to aid students in their learning of content and skills. One of his 
seventh-grade teachers was eager to try out their use in the teaching of spelling; 
the school district had agreed to supply the machines; the teacher, with the 
help of his coworkers, planned to compare results in the classes using teaching 
machines with those in matched groups, comparable in scholastic aptitude and 
other important factors, which were taught spelling in the usual manner. Fortu- 
nately, since the teacher was doing the research project for his master's degree, 
the planning of the research could be left largely to him and to his supervising 
professor at the college. However, Mr. Smith and his staff would be concerned 
with the findings of this study and the implications for their school, and hence 
they did have the responsibility of seeing that adequate tests of proficiency in 
spelling were selected to evaluate student progress. Certainly a test developed 
for nation-wide use would not be as adequate а basis for judgment as tests 
based on the state spellers. 

4. Mr. Smith reported that he had been asked to head a city-wide evaluation 
committee. This group was to provide information to the superintendent and 
governing board that would help them to decide whether eighth-grade students 
were achieving an adequate knowledge of the history of their nation and their 
state. With respect to United States history, Mr. Smith realized that the com- 
mittee's problem would be one of selecting а test, with national norms, that 
would be considered fair by students and teachers and that would have a suffi- 
cient range of difficulty to measure student knowledge in all the schools of a 
heterogeneous community. 

The decision concerning the adequacy of student knowledge of state history 
posed a different problem. No published tests were available. It would probably 
be best for his committee to ask representative teachers from the different 
schools to plan a blueprint for the test, which would indicate how much 
emphasis should be given to different objectives and areas of content. The 
course of study and the state textbook could be analyzed as an aid to teacher 
judgments. Once decisions regarding objectives and content were made, how- 
ever, one or more teachers with special proficiency in item writing could devise 
items that would be reviewed by the representative group. 

Mr. Smith wondered if, as an additional by-product, he could obtain infor- 
mation that would help on one aspect of Problem 1, which would help him in 
appraising how well his staff was teaching American and state history. One 


12 THE EVALUATIVE PROCESS 


teacher reminded him that a test which might be quite adequate in sampling 
student information about history might fail to measure other important out- 
comes. In other words, a test that might serve to assure the public that a 
minimum competency in knowledge was being achieved by students would need 
to be supplemented by other measures before an over-all appraisal could be 
made concerning the adequacy of the instructional program. 

5. Another problem concerned the selection of students for participation in 
a second-semester pilot program for gifted seventh-graders. These students 
would have their English, social studies, and mathematics instruction with a 
teaching team of three teachers especially interested in teaching the gifted and 
presumably most qualified to do so. Approximately one hundred students were 
to be selected from a seventh-grade population of five hundred. A committee 
had already decided that the selection should be made on the basis of the 
students' scholastic aptitude, reading ability, previous achievement in elementary 
school subjects, and teacher ratings on such significant traits as "seriousness 
of purpose." 

Fortunately relevant data were already available in the school records. More- 
over, the Committee on Gifted Students had already decided what factors to 
consider. An intelligence quotient of at least 120 and reading ability at the 
eighth-grade level or above were to be required. In choosing among the students 
who qualified on these two bases, considerable weight was to be assigned to 
sixth-grade marks. Greatest weight was to be given to ratings on "seriousness 
of purpose,” “responsibility,” and “originality,” because the future of the entire 
program depended on how well students achieved in this pilot project. 

Obtaining comparable data on such traits as the “seriousness of purpose” of 
students coming from 18 sixth-grade classes in 4 elementary schools seemed 
impossible. Staff members warned that the subject grades of teachers from the 
different elementary schools would not be comparable. Moreover, the Commit- 
tee on Gifted Students had evidently assigned the greatest weight to the factors 
that would be the most difficult to measure, that is, teacher ratings on "origi- 
nality" and other traits. 

6. Decisions of still another type needed to be made in the counseling of 
eighth-grade students concerning their choice of college preparatory subjects 
for the ninth grade. This problem was similar to problem 5 in that predictions 
of future achievement were involved. 

In problem 5 there was a maximum number that could be accommodated, 
hence this was a selection problem similar to that of a company selecting 
persons for a limited number of jobs. In problem 6, the principal is under no 
serious restrictions with respect to numbers; for example, if another class in 
algebra is needed, it can easily be substituted for one in general mathematics. 


The prediction problem here is one of classification or placement of students 


in terms of predicted “best treatment,” as when a personnel manager recom- 


Introduction 13 


mends the most suitable jobs for each of a group of persons who have already 
been employed. 

Problems 5 and 6 differ in another respect. Problem 5 involves institutional 
decisions about students; problem 6 involves decisions by students. The job of 
the school is to help students and their parents make thoughtful choices, on 
the basis of interpretation of information most relevant to this decision. One 
of three choices could be made. For some students, the data on test scores and 
grades would show high academic aptitude and evidence of adequate motiva- 
tion; for these students, taking both college-preparatory mathematics and 
language in ninth grade would be clearly indicated. The test scores and grades 
of other students might support the decision that it was best to attempt neither 
subject at the ninth-grade level, but to decide later (after a year of general 
mathematics) whether algebra was advisable. Still other students might find it 
advisable, on the basis of marginal or inconsistent data, to attempt one but not 
both college-preparatory subjects in the ninth grade. 

7. The school counselors and the ninth-grade social studies teachers would 
be faced with the job of helping pupils make tentative choices regarding their 
vocational plans. Although specific decisions could be postponed until senior 
high school, the combination of the ninth-grade vocations unit and individual 
counseling should help students to appraise their abilities, achievements, and 
interests in such a way as to make wiser vocational choices. One of the coun- 
selors reported that on Career Day 90 percent of the ninth-grade students had 
signed up for a dozen popular vocations. He contended that the use and inter- 
pretation of an interest inventory would help many of them to broaden their 
perspectives. He also suggested that an aptitude test battery be administered, 
which would help him as counselor in conferences with students and parents 
on educational and vocational planning. 

Problem 6 had required only a prediction of the general level of the 
student’s success in college-preparatory subjects. In vocational guidance, how- 
ever, the predictions are of a higher order of complexity. For example, would 
John, an able student who is trying to decide between two. ‘vacations that 
interest him, be more likely to succeed in engineering ог medicine? Answering 
questions of this type requires comparative Or differential prediction. Tests 
that help to predict a student’s general level of achievement in academic work 
might not be adequate for predicting the difference between his predicted 
achievements in two vocational fields or two college curricula. 

The student is concerned not only about his relative success in different 
vocational fields but also about how satisfying different vocations would be in 
the light of his interests and temperament. Making inferences concerning 
probable job satisfactions is unusually precarious. In predicting achievement, 
one is usually safe in assuming that the more ability a student has shown to 
date (in music, sports, or some other field); the more success he will probably 


14 THE EVALUATIVE PROCESS 


have in the future. That is, the higher the score the better the prospects. But 
in predicting job satisfaction, even this assumption seems questionable. For 
example, there might be an optimum range of intelligence for file clerks; with 
the brighter girls finding the field too routine and unchallenging; similarly, it 
might be best for a prospective librarian to be below average in certain charac- 
ieristics, such as energy level and extroversion. Certainly the problem of 
predicting “job satisfaction” is of a different nature than that of predicting 
success, not only with respect to the types of test data that might prove most 
usable but also the techniques used in making predictions and the degree to 
which one could feel confidence in making predictions. 


SUMMARY STATEMENT 


Although we have listed illustrative problems facing a junior high school princi- 
pal, the decisions involved are shared by teachers, counselors, and others en- 
gaged in the complex job of educating and guiding youth. Moreover, these 
problems are not confined to one age or maturity level. If a similar list of prob- 
lems had been prepared by an elementary school principal, problems 1, 2, 3, 
and 5 might have been much the same: problem 4 would have been similar, 
although the areas in which national norms would have been stressed would be 
the “fundamental skills,” while the committee of local teachers might have been 
needed to develop a test on a social studies area, such as Latin America. And 
since elementary school teachers are permitted wide latitude in choosing the 
content through which unit objectives are realized, a committee might decide 
against the construction of a city-wide test, preferring to develop a pool of items 
from which individual teachers could draw in developing their own classroom 
tests. Although problems 6 and 7 (as stated) are foreign to the elementary 
schools, similar prediction problems do occur at that level, for example, group- 
ing children into reading and other instructional groups within the classroom 
on the basis of reading readiness scores and teacher judgment, as well as helping 
parents to decide whether their students are likely to benefit from special reme- 
dial services or from being accelerated or held back. 


Carefully selected tests can provide data that will aid in making many pro- 
fessional judgments and decisions. However, 


a test that is best for checking on 
how well students have mastered the minimum essentials of a course may not 
be best for predicting how well they will achieve in a similar course of a higher 


level. We need to study each type of inference or decision made for which we 
will be using test results. 


Obviously, in some of the problems listed above, sound decisions require the 
use of published tests for which national norms are available, that is, data re- 
garding the test performance of representative groups of students in the nation's 
schools. When we are asked to inform the superintendent and community about 
students’ knowledge of American history, we must admit that we have no 
absolute yardstick. But we can 


у | ; if we use a test with national norms, interpret 
students’ scores in terms of how they comp 


sc are with the success of students of 
similar ability throughout the country. 
On the other hand, for problem 3 (on the use of 


l | : teaching machines in spell- 
ing instruction), national norms are not needed; and 


à local test would be more 


Introduction 15 


adequate than a standardized one. We could take every twentieth word from 
the 500 words in the state speller to make a representative 25-word test. If the 
average score on this test at the end of the year is 20 words, or 80 percent, we 
can infer with a fair degree of confidence that students can spell 80 percent of 
the entire list of 500 words. Types of decisions for which we need norms, and 
types of local and national norms, are discussed in Chapter 2. : 

The expression, "with a fair degree of confidence," used in the preceding 
paragraph, refers to the concept of errors in measurement. Certainly we saved 
a good deal of time when we used a short 25-word test rather than one with one 
hundred words or more. But wouldn't one sample of only 25 words tend to 
favor some students who, by chance, had drilled on certain words that happened 
to be on that list? Since problem 3 involved a comparison of averages,* we are 
on fairly safe ground in making the inference concerning the level of achieve- 
ment on the total list. If we were going to use such a test for making inferences 
about the achievement of individuals, however, and for assigning individual 
grades in spelling, the test would need to be longer so as to minimize the effect 
of the chance factors involved in any sampling process. The sources of errors in 
measurement and methods of estimating the size of measurement errors will 
be considered in Chapter 3. " , 

On some problems, such as the spelling study, we can easily devise a test 
that constitutes a representative sampling of the content in which we are in- 
terested. One or more random samplings of the state speller list of five hundred 
words are easily obtained. In developing the test of state history, we again wish 
to sample a universet of possible items, but it is impossible to define that uni- 
verse as precisely as we could do with the spelling words. The question of how 
to achieve representativeness of content in such situations 15 considered in Chapter 
4. We will also study predictive validity, which is involved whenever we use 
tests, ratings, or personal judgments to select students for a special class or d 
à scholarship, whenever we group students, or whenever we help students ks 
decisions regarding subject or vocation choices. A test that might be quite valu- 
able for assessing student learning of a representative sampling of spelling words 
(allowing us to make sound inferences concerning student proficiency on the 
total list) might have little value in predicting whether students should take an 
enriched English course or choose a career in secretarial work. These and other 
concepts related to the validity of tests and other measures will be studied in 
Chapter 4. 


з Obviously, a very large number of spelling tests could have been composed from 
all possible combinations of 25 words from the population of 500. наве peleana 
how to determine, by methods described in Chapter 3, how well students scores on 
one sample test agree with those on another sample, we can estimate the amount 
of error variance in students’ scores and the standard error of measurement of a 
student's score. It is sufficient to emphasize here that the standard error of an 
average score for a large group of students is Very small, in comparison with the 
standard error for individual scores. For example, in a problem of this type, if a 
group of 200-300 students averaged 80 percent spelling words correct, we could infer 
that the students knew 79-81 percent of the 500 words in the spelling list sampled 

+ Throughout the textbook, the term "sample" refers to a group of test items 
used or a group of individuals tested, while the term "universe" or "population 
designates the larger defined group of which the sample is supposed to be repre- 
sentative. 


16 THE EVALUATIVE PROCESS 


In Chapter 5, we will illustrate how we can apply in the process of test selec- 
tion the concepts developed in Chapters 2, 3, and 4, with respect to norms, 
reliability (errors in measurement), and validity. Chapter 5 will also consider 
the aids that the profession has developed to help us in this process of apprais- 
ing published tests. 


SELECTED REFERENCES 


ADKINS, DOROTHY C., "Measurement in Relation to the Educational Process," 
Educational and Psychological Measurement, vol. 18 (Summer 1958), рр: 
221-240. 

BORDIN, EDWARD 5., “Ethical Responsibilities of Instructors in Testing Courses,” 
Educational and Psychological Measurement, vol. 11 (Autumn 1951), рр. 
383—386. 

EBEL, ROBERT, AND DORA DAMRIN, "Tests and Examinations," in C. W. Harris, 
ed., Encyclopedia of Educational Research, 3d ed. New York: The Mac- 
millan Company, 1960, pp. 1502-1517. 

SCATES, DOUGLAS E., "Some Problems Connected with Evaluation," Journal of 
Educational Research, vol. 45 (April 1952), pp. 599-608. 

SIEVERS, FRANK L., AND OTHERS, “Testing Issue," School Life, vol. 42 (Septem- 
ber 1959), pp. 3-27. 

"Testing and Evaluation," National Education Association Journal, vol. 48 
(November 1959), pp. 15-31. 

WRIGHTSTONE, J. WAYNE, What Tests Can Tell Us about Childre 


n. Chicago: 
Science Research Associates, 1954. 


Interpreting Test Data 
in Terms of Converted Scores 


Before we consider the more complex problems involved in test construc- 
tion and test selection, let us consider the concepts involved in the inter- 
pretation of test data already collected. 

Students’ scores on a test (usually the number of items answered cor- 
rectly) have little meaning except to indicate the relative position of each 
student in the class on each section of the test. Such untreated scores are 
known as raw scores. A student's raw score on one section of a test is not 
directly comparable to his raw score on another section, which may have a 
larger or smaller number of items of greater or less difficulty. Before such 
Scores can be used to appraise a student's relative strengths and weak- 
nesses, they must be expressed in comparable units. In other words, the 
Taw scores must be translated into converted scores, which show (1) how 
à student's performance on the test compares with some arbitrary stand- 
ard (such as a perfect test score) or (2) how his score compares with 
the scores of others in his class, his school, or some other group with whom 
he can appropriately be compared. u . 

After the necessary test construction and test administration had been 
completed, Mr. Smith and his staff had available to them: 


1. The results for six seventh-grade classes on the spelling test. 

2. Results for all eighth-graders at Central Junior High School on a locally 
devised arithmetic test, developed by the eighth-grade teachers of that 
school. 

3. The results for all eighth-grade students in the school district on both a 
locally developed state history test and a standardized United States history 
test. 


For the standardized United States history test, the publishers had de- 
veloped norms on students representative of the national population of 


17 


18 THE EVALUATIVE PROCESS 


eighth-graders; hence each student's raw score could be interpreted in 
terms of percentile ranks (which will be explained in a later section). For 
the local tests, however, no such norms were available. 


CONVERTED SCORES BASED ON COMPARISON 
WITH A PERFECT SCORE 


The research committee on spelling had decided that there was no need 
for norms. Each student's score could be compared with a perfect score on 
the test. From a "percentage correct" score on the spelling test, one could 
infer the average percentage of seventh-grade spelling words that a pupil 
could spell. And this was the sole purpose of the spelling test. 

* The city-wide committee on history also intended to translate raw scores 
on the state history test into “percentage correct" scores. In fact, they had 
designed the final edition of the history test to have one hundred items, 
so that each student's raw score would be his “percentage correct" score. 

"Percentage correct" scores, however, are most meaningful when we are 
able to define a universe of learnings (as we did in spelling) and sample 
it in such a way that the test is representative of this defined universe.” 
The ease with which the spelling committee had been able to take a ran- 
dom sampling of a defined universe of spelling words made the history 
committee members wish that all evaluation problems in education were 
as simple. 

Students’ “percentage correct" scores on the history test are comparable 
with those on another test, for example, the arithmetic test, only when two 
tests are equally difficult and when students’ scores show equal “scatter” 
or variation around the average. Actually, tests rarely meet these condi- 
tions unless great care has been taken in their construction. Hence teachers 
are not justified in drawing inferences about improvements or retrogres- 
sions in student achievement when "percentage correct" scores increase or 
decrease throughout the school year. 


Percentage correct" scores provide no answers to such questions as 
faced Mr. Smith and his teachers: 


1The student will recall that construction of the spelling test had only required 
the selection of every twentieth word from a defined universe of 500 seventh-grade 
spelling words and that from a student's “percentage-right” score, one could easily 
infer his probable level of achievement on the entire 500 words. 

2 Throughout the textbook, the term "sample" is used to refer to a group of test 
items used, or a group of individuals tested, while the term "population" ог 


"universe" designates the larger defined group, of which the sample is supposed to 
be representative. 


Test Data in Terms of Converted Scores 19 


1. Was the average score in the school (or in a class group) as high as it 
should be? и й 

On the average, was the school (ог a с!а55 group) doing as well in their 
knowledge of state history as they were in United States history? And how 
did the students’ achievements in both areas of history compare with their 
achievement in arithmetic? 

3. Was a given individual doing as well in state history as he was in United 

States history? or in arithmetic? 


ә 


Answering their first question inevitably involves professional judgment. 
However, certain procedures provide information helpful in making such 
professional judgments: 


a. comparing the achievement of their students with those of eighth- 
graders in the country as a whole (which could be done for the United 
States history test if the sample on which the test was standardized 


seemed sufficiently representative) .? 
b. comparing student achievement with some external criterion, for exam- 


ple, checking to see how many students attained a level of arithmetic 
achievement associated with satisfactory performance in general math- 


ematics or algebra courses. 
c. having representative teachers rate each test question as "essential," 
“desirable,” and the like and then determining what percentage of 
students answered these most significant questions accurately. Here 
again, subjective judgment would be involved in deciding what level 


of achievement was satisfactory. 


stions could best be answered by developing 
local norms. Such norms could be used to translate raw scores to converted 
scores, which would be comparable from test to test. The running of an 


adding machine tape had revealed the following average scores on the 
three tests. The term “mean” is used in statistical work for this type of 


The second and third que 


e. Mean — = where X is a symbol meaning “the sum of"; X stands 


averag 
for score, and М, for number of cases. 
MEAN MEAN 
(AVERAGE) (AVERAGE) 
RAW SCORE PERCENT CORRECT 
Arithmetic test (100 items) 84 84 
State history test (100 items) 78 78 
109 62 


United States history test (175 items) 


Obviously, if a hypothetical student had made an average “percent correct” 


score in each test, we could not infer that he had done best in arithmetic, 


3 The factors that should be considered in atttaining representativeness in norming 
samples are considered on pp. 60-62. 


20 THE EVALUATIVE PROCESS 


next best in state history, and least well in United States history. If a stu- 
dent happened to earn these scores on the three tests, we would have to 
say that he had done equally well on all three tests (if we used comparison 
with other students as our basis of interpretation). AII the remaining types 
of converted scores, to be discussed in this chapter, involve such inter- 
individual comparisons. 


CONVERTED SCORES BASED ON COMPARISONS 
AMONG EXAMINEES 


There are three main approaches to obtaining converted scores, on the 


basis of comparisons among individuals. These approaches, as summarized 
in Table 2.7, include: 


1. Comparison in terms of the difference between a student's score and the 
group average or теап,* this difference to be expressed in terms of a standard 
unit (the standard deviation? or some multiple thereof). 

2. Comparison in terms of the rank of the student's score within the group of 
all students tested (or some defined reference group). 


3. Comparison in terms of the average age or grade status of students obtaining 
the same score. 


The third approach (age or grade norms) would be unsuitable for the 
state history test. It would be meaningless to interpret a student's score in 
eighth-grade history as average for a ninth-grader or tenth-grader. This 
history course is given only in eighth grade; hence, at the end of the eighth 
grade, students would earn a higher average score than they would if tested 
а year or two later at these higher grade levels. 


Both the first and second types of norms, however, are suitable, and 
each type has certain advantages. Percentile scores (the second type) are 
more easily interpreted; but standard scores (the first type) have units of 


* The “mean” (M) is a term used t 
totaling all scores and dividin 


5 The term “variability” refers to the extent to which scores are clustered closely 


around the average or more widely dispersed. For example, the variability of all 
high school students with respect t 


& О height would be greater than the variability for 
either boys or girls, considered as separate groups. Of the frequently used measures 
of variability, the standard deviation (SD) is the most stable and meaningful. (J. P. 
Guilford, Fundamental Statistics in Psychology and Education, 3d ed. (New York: 
McGraw-Hill Book Company, Inc., 1956), p. 99. The SD can be computed for а 
set of data by the following formula: SD — A 

N 


о designate the type of average computed by 
g the sum by the number of cases. 


where X stands for “sum of,’ 


x — the difference between each raw score and the average, and N = the numbel 
of cases. Ап approximation formula for the SD will be used in this chapter. 


FS 


=== : 


Test Data in Terms of Converted Scores 21 


equal size (that is, representing equal ranges in raw scores) throughout the 
scale. Moreover, standard scores can be used in computing averages and 
making other needed computations, whereas one cannot average percentile 
Scores. 


Standard Scores 


Standard scores, although more difficult to understand than percentile 
scores, have many advantages. In computing a z-score, which is the basic 
type of standard score, one finds the difference (or deviation) between a 
student's raw score (X) and the average or mean (M) for his school (or 
other reference group). Then this deviation (x) is divided by the standard 
deviation (which constitutes a standard unit of measurement). The formula 
may be written in either of two ways: 


That is, we first find how much the student's score falls above or below the 
mean. Then, if we are going to compare à student's performance on this 
test with his performance on other tests, this difference must be expressed 
in terms of some standard unit. 

The need for some standard unit is appa 
Score ten points above the average on a 25 
be the highest score, while a score ten points а < ега 
word vocabulary test may represent ап insignificant variation from the 
average. Not only is length of test important but the extent of dispersion 
or variability in student scores. If a test is Very easy for a group (as a 
300-word high school vocabulary test would be for college students), the 
dispersion in scores may be small even though the test is long. That is, 
almost all the students’ scores may be clustered close together. 

Test specialists have found that the best procedure for obtaining scores 
that are comparable from test to test 15 to divide each deviation score by 
a measure of the dispersion or variability of student scores around the 
average. The SD (standard deviation) has come to be preferred because 
of its meaningfulness and its broad applicability. 


rent when we recognize that a 
-problem arithmetic test may 
bove the average on a 300- 


THE STANDARD DEVIATION AS A UNIT OF MEASUREMENT The standard 
deviation is used as the common unit of measurement in comparing test 
data and other educational measurements. Such a unit is greatly needed in 
education because of the impossibility of establishing a zero pointzok.a, 


4 & 


: T X. yi. "€. 
seb 9 7" eg, 1.45 | Ф ө» 


b 


P 


22; THE EVALUATIVE PROCESS 


maximum for such attributes as scholastic aptitude, competency in hand- 
writing, or social adjustment. Before we learn how to compute the standard 
deviation, we should examine the type of distribution frequently found 
when we graph the frequency (number of cases) for each score. 

When a large number of individuals, selected at random, are measured 
with respect to almost any dimension (height, shoulder width, scholastic 
aptitude, and the like), a graph showing the frequency, or number of cases, 
resembles the normal, or probability, curve; that is, the frequency distribu- 
tion is similar to the bell-shaped distribution shown in Figure 2.1. There 
tend to be relatively few cases with scores at each extreme and relatively 
large numbers of cases with scores near the average. 


Includes middle 68% 
of cases 


Includesl6% Includes 16% 


Frequency or Percent of Cases 


-350 -250 -|50 M +150 +250 +350 
22-3 7=-2 22-1220 2241 22+2 z=+3 


Fig. 2.1 А Normal Distribution Curve Showing Mean and Stand- 
ard Deviation 


Ме tenes normal distribution, as shown in Figure 2.1, the middle 
mal ium ke bios oN benda between Scores one SD below the M 
UD ede ae e ч псе the size of the SD is determined, any 

у ; € interpreted in terms of its deviation 
or below the mean in units of standard deviation. 

For example, if the heights of 1000 seventh-grade boys were measured 
and the frequency (number of cases) for each height were graphed, the 
result would approximate a bell-shaped curve, called the normal curve. 
Graphing the frequencies for each height for 1000 ninth-grade boys would 
result in another approximately normal curve with a somewhat higher 
mean height and somewhat greater variability. The heights of all seventh- 
grade boys could be translated into z-scores, using the seventh grade mean 
and SD; while those for ninth-grade boys could be converted into z-scores 
on the basis of the ninth-grade mean and SD. Thus, it is possible to com- 


above 


Test Data in Terms ој Converted Scores 23 


pare a certain boy's height with those of others his own age when he is in 
the seventh and later in the ninth grade. The comparison of such con- 
verted scores would reveal whether he maintained the same relative 
position in the two age groups or whether he was relatively shorter or taller 
in the ninth grade than he had been in the seventh grade. At each grade 
level the deviation of his height from the mean would be divided by the 
SD for that age group. 


AN ILLUSTRATION OF COMPUTED Z-SCORES (BASIC STANDARD SCORES) 
Since all readers are familiar with IO's, they will be used as our first illus- 
tration of the computation of z-scores. Here the mean (M) for the general 
population is 100, and the standard deviation (52) is 15 to 16 10 points. 
We will assume an SD of 15. Hence a student with an IO of 115 has a 
z-score of 1.0 because he is one SD above the mean. Similarly, a student 
with an IQ of 130 has a z-score of 2.0. 


When X = 130 g= 139—100 . 120 
C Sets 15 
= 4 E 
Марне а= H3 = +10 
Or 
1 - 
When X — 100 ;2:100—100 _ 05 
и == 15 
= à 
When X= 85 z= m =-10 
where x = X — M 
70 — 100 
When X = 70 = E DS = —2.0 


Equal differences in z-scores correspond to equal differences in the raw 
Scores on which they are based. 


COMPUTATION OF Z-SCORES AND T-SCORES FOR THE LOCAL TESTS Itis 
evident that if we knew the SD's and means for the scores on the history, 
arithmetic, and other tests, we could readily compute z-scores for all stu- 
dents by this simple formula, Then these scores on the achievement tests 
would be comparable with each other and with those on an intelligence 
test.6 


"One could infer, for example, that а student with an 10 of 115 (z-score of 
1.0) who had a z-score of approximately 1.0 on the history test or any of the 
Other tests was working approximately at ability level. Such a comparison of z- 
Scores (based on a sampling of the nation’s population, as in the IQ test), with 
z-scores based on locally developed tests would be valid only if the local 10 dis- 
tribution was such that the mean was approximately 100 and the SD approximately 
15, 


24 THE EVALUATIVE PROCESS 


In the preliminary analysis of the data for Central High School, an 
approximation formula for the SD will be adequate. The use of this simpli- 
fied formula gives us an SD of 10 for scores on the state history test, given 
in Table 2.1. The mean has already been computed as 78.0 by simply 
totalling the scores and dividing the sum by N (the number of cases) 
( М = xt ) . The M and 5D are then used to compute illustrative z-scores, 


shown in Tables 2.3 and 2.4. 


| Table 2.1 | 
Test Scores for Central High School Students in State History Test 


RAW RAW 
SCORE FREQUENCY SCORE FREQUENCY 
X  TALLIES f іХ X TALLIES f ЇХ 
98 || 2 196 75 || 4 
97 || 2 194 “n W 5 
96 73 || 4 
95 | 1 95 | Sum of та ll 3 
94 | 2 188 | highest 71 | 3 
93 2 186 | 1/60f 70 || 3 
92 |] 2 184 | scores 69 Ж 5 
91 3  273| —1852 68 || 3 204 
90 | 3 270 67 2 134 
89 | 2 178 | 66 | 3 198 
88 | 1 88 65 | 3 195 
87 3 64 | 1 64 
86 | 4 63 2 126 | Sum of 
85 3 62 | 1 62 | lowest 
84 4 61 1/6 of 
83 4 60 | 1 60 | scores 
82 4 59 || 2 18| =1273 
ы tt 5 58 | 1 58 
80 Ж 5 57 
79 | 4 56 
78 ДП 7 55 
77 4 54 | 1 54 
76 Wi 6 


Standardi авики = Sum of high sixth — sum of low sixth 
Half the number of students 
_ 1852 — 1273 _ 579 


* Formula taken from Paul В. Diederich Short-Cut Statistics for Teacher-Made Tesis, Evaluation 
and Advisory Service Series No. 5 (Princeton, N. J.: Educational Testing Service, 1960), p. 21. 


Test Data in Terms of Converted Scores 25 


Table 2 


2 


Frequency Distribution for Central High School Students in State History 
Test and Computation Guide for Obtaining the Mean by the Short Method 


SCORE 


INTERVAL® f d fd 
NUMEN 


96-98 3 +6 +18 1; 
93—95 5 +5 +25 
90-92 8 +4 +32 
87-89 7 +3 +21 n 
84-86 11 +2 +22 = 
81-83 13 +1 +13 
78-80 16 0 (+131) 
75-77 14 —1 —14 
72-74 12 —2 —24 3 
69-71 11 23 —33 
66-68 8 —4 —32 
63-65 6 =5 —30 4 
60-62 2 —6 —12 
57-59 3 - —21 
54-56 1 —8 — 8 5 
N = 120 (—174)^ 
>]4 = —43 6. 


DIRECTIONS FOR COMPUTING MEAN 
FROM GROUPED DATA 


Choose any interval as an arbitrary 
origin. Here the interval 78-80 has 
been chosen. 


. Assign a d value to each interval (in 


terms of the number of intervals it 
lies above or below the arbitrary 


origin). 


. Then, in each row, multiply the entries 


in the f and d columns, entering prod- 
ucts in the fd column. 


. Add the fd column to obtain 2 fd (2 


is a symbol for "the sum of"). 


. Obtain the correction (in intervals) 


by dividing E fd by N (number of 
cases). 

Multiply the correction by i (size of 
interval) to obtain the correction in 
score points. Then add it algebraically 
to the assumed mean (the midpoint 
of the interval selected as arbitrary 


origin). 


ee 


Assumed Mean (AM) = 


Correction (с) = —— 


Mean (M) = 


of 78-80 interval = 79 


midpoint 

zf = —9 _ _ 036 

N 120 

AM + (с) (0 

79.00 + (—0.36)(3) = 79.00 1.08 — 
71.92 — 78 


————— ыала 
"In order to estimate the size of interval (i) to be used for tallying test scores, one can 
divide the range of scores by 15. In this case, the range of scores was from 54 to 98, or 44 


points. The size of interval was Ae 
15 


, or 3. Division by 15 is recommended since at least 15 inter- 


vals are desirable. A smaller number of intervals usually involves too much loss of information 
about the distribution of scores and increases the error in computation that results from not 
using precise score values. Ordinarily, intervals of 3, 5, 10, or multiples of 10 make for ease 


in tallying. 


b a 
Partial sums of positive and negative fd values. 


26 THE EVALUATIVE PROCESS 


Table 2.3 
Frequency Distribution for Central High School Students 
in an Arithmetic Test 


SCORE 
INTERVAL 1 а ја 
96–98 21 +5 +105 Examples of computation of standard scores 
93-95 18 +4 +72 for skewed distribution 
90-92 12 +3 + 36 2 Е 
87-89 10 42 + 20 T€ Raw Score Мп 
84—86 8 "РІ + 8 Standard deviation 
81-83 7 0 (+241) М = 84 5р = 11 
78-80 8 = — 8 
75-77 6 —2 — 12 Highest score 
Bie в =з =04 yap 220595. J9 us 
69—71 5 4  — 20 11 11 
66-68 4 —5 — 20 Ап average score 
63-65 4 —6 — 24 84 — 84 0 
60-62 3 =7 — 21 X = 84 ЗЕ EE. 0 
57-59 3 =з = 24 T E 11 
54—56 3 —9 =-27 owest score - а - 
120 (—180) X = 54 т ~ 27 
га = + 61 Other examples 
X = 90 205-846 _ 6 zd 
11 п 
X ga 5084 тї 4 
11 11 
70—84 — 14 
X = 70 = gm ш 1.3 
11 11 
60 — 84 — 24 
X = 60 — =>28 


11 11 


М = AM + (c)(i) = 82 + (.51)(3) = 83.53 = 84 


Stand. Dev. (SD) = Sum of high sixth — sum of low sixth 
Half the number of | students 
1943 — 1260 683 
=e = 11.4 
60 60 


ААА 


Test Data in Terms ој Converted Scores 27 


The distribution of history scores in Table 2.2 is highly symmetrical and 
approaches a normal curve, while the distribution of arithmetic scores in 
Table 2.3 is skewed, or not symmetrical. Students' scores are piled up at 
the high-score end of the distribution, with more than half the scores in the 
top four intervals. In both the normal and the skewed distribution, differ- 
ences in z-scores faithfully reflect proportional differences in raw scores, 
as shown in Table 2.4. 


Table 2.4 
Comparison of z-scores and T-scores for Students with Identical 
Raw Scores on Arithmetic and State History Tests 


en 


Raw scores* z-scores” T-scores^ 
STATE STATE STATE 
ARITH. HISTORY ARITH. HISTORY ARITH. HISTORY 
И ________-- c 
James 90 90 0.5 12 55 62 
Mary 80 80 —04 402 46 52 
Sandra 70 70 —13  -08 37 42 
John 60 60 —22 —18 28 32 


"Note that although both these tests are 100-item tests, identical raw scores on the two 
tests are translated into different z-scores and T-scores because of differences in the difficulty 


of the tests and slight differences in the SD values, which reflect the degree to which scores 
а raw score of 90 in the state history test repre- 


are dispersed around the mean. For example, 
п does a score of 90 on the 


sents greater superiority in comparison with other students tha 


arithmetic test. 
"Equal differences in raw scores are accurately reflected in equal differences in standard 


scores, that is, Mary and James differ by 10 raw score points in arithmetic; so also do Sandra 
and John. The difference in their z-scores in arithmetic is 0.9 for each pair; the difference in 


their T-scores is 9 points for each pair. The formula for the T-score is as follows: T = 10z + 50. 


The z-scores, although easy to compute, have the disadvantage of involving 
decimal points and minus signs. Hence, it may be desirable to translate 
z-scores into equivalent T-scores,' which avoid these two problems. For 
T-scores, the mean on any test is equated to 50 and the SD, to 10. 

Since we will later be working with large numbers of cases, the students 
may wish to learn a method of computing the mean from grouped data. In 
Table 2.2, the short method of computing the mean from grouped data 
is illustrated for the history test. In each case, the test data have been 


* T-score = 10; + 50. 


28 THE EVALUATIVE PROCESS 


tallied by an interval of 3. This method gives almost identical results? to 
those obtained by adding original scores and dividing the total by the 
number of cases. 


Converted Scores Based on Student's Rank within Group 


We will now consider types of converted scores, based on the student's 
rank within a group. It was obviously of limited value for the city-wide 
history committee to compute “rank in class" for each student, since classes 
varied in size, or to compute “rank in school,” since the size of eighth- 
grade classes varied from 80 to 170 in the different junior high schools. To 
rank fortieth in the smallest school would be to rank close to the average; 


to rank fortieth in the largest school would be to be in the top one-fourth 
of the group. 


PERCENTILE SCORES A conference with the research director convinced 
the history committee that the use of percentile scores was advisable. A 
simple graphic procedure would make it possible to find the percentile 
Scores corresponding to each raw score. A student with a median score 
would have a percentile score of 50, since his raw score exceeded those of 
50 percent of the group; a student would have a percentile score of 90 if 
his raw score exceeded those for 90 percent of the group. 

Mr. Smith decided to experiment with these procedures with the data 
for Central High School before he applied them to the city 
He computed the cumulative frequency and cumulative perce 


interval, as shown in Table 2.5. He then used a formul 
score value for Ра, 


-wide results. 
ntage for each 


à to compute the 
the score that exceeded those for 40 percent of the 


group. The work involved to obtain only one of the 99 percentile scores 
by formula convinced him that it was best to shift to the graphic method. 

In Figure 22, the cumulative percentage for each score interval is 
plotted in a curve, called the ogive. From this graph, the percentile score 
Corresponding to each raw score can be read. The value read from the 
graph for Ра, is similar to the computed percentile score of 75.6. In this 


graph and its footnote, the procedures for obtaining a percentile score for 


method in Table 2.2 being 77.9. This small error is introduced by grouping the data 
rather than by using the "short method," which simply reduces the amount of 
computation. 


Test Data in Terms of Converted Scores 29 


Table 2.5 


Cumulative Frequencies and Cumulative Percentages Used in Graphing 
Ogives for State History Test and Local Arithmetic Test 


————Є— 


5СОКЕ 
INTERVALS 


96-98 
93-95 
90-92 
87-89 
84-86 
81—83 
78-80 
75-17 
72-74 
69-71 
66-68 
63-65 
60-62 
57-59 
54—56 


State history test Arithmetic test 
CUMU- CUMU- 
LATIVE CUMU- LATIVE CUMU- 
FRE- FREQUENCY LATIVE FRE- | FREQUENCY LATIVE 
QUENCY fs PERCENT QUENCY 7 PERCENT 
a 
3 120 100.0 21 120 100 
5 117 97.4 18 99 83 
8 112 93.3 12 81 68 
7 104 86.6 10 69 58 
11 97 80.8 8 59 49 
13 86 717 ү 51 43 
16 73 60.8 8 44 37 
14 57 47.5 6 36 30 
12 43 35.8 8 30 25 
11 31 25.8 5 22 18 
8 20 16.7 4 17 14 
6 12 10.0 4 13 11 
2 6 5.0 3 9 7 
3 4 3.3 3 6 5 
1 1 0.8 3 3 2 
120 120 


Directions for Computing Percentile Points 


The percentile point, below which falls any specified percentage of scores, may be 
obtained either by use of a graph (as in Figures 2.2 or 2.6) or by use of the fol- 


lowing formula: 


Where 


P 


LL — 


ЈЕ = 


| 


LL + (==) xi 
fo 
lower limit of interval containing the desired percentile point 


(always 1⁄2 unit below the score limit of interval) 


a fraction that varies with the percentile desired; for the 
Mdn or P,, (the point below which 50 percent of the cases 


40 
fall), F is 30; for Р, F is —, and the like. 
100 100 


— number of cases (or test scores) 


— cumulative frequency below the interval containing the de- 


sired percentile point 
number of cases (or test scores) within the interval contain- 
ing the desired percentile point 


size of interval (in this distribution i = 3) 


30 THE EVALUATIVE PROCESS 


Table 2.5 (Continued) | 
Cumulative Frequencies and Cumulative Percentages Used in Graphing 
Ogives for State History Test and Local Arithmetic Test 


E M ——————— RN 


40 
i д ich i X 120 — 48. 
EXAMPLE If we are computing P,,, we first find FN, which is б 


Note that in the f, column for the state history test, there are 43 cases below the 
interval (75-77) that contain P,,; therefore f, = 43 and fẹ = 14 


Py. = 745 + 3 [E x] = 745 + 3 e = 745 + 11 = 75.6 
4 


14 14 


a given raw score (X — 70), and for obtaining the raw score for a speci- 
fied percentile score (P;;) are illustrated. А . 
Percentile scores have many advantages. They are easily obtained by 
the graphic method. They can be easily interpreted to students and parents, 
without bringing in the concept of standard deviation, which is more dif- 
ficult to understand. Percentile scores, however, have certa 
As a measure of growth, they can be misleading; 
who has a percentile score of 60 at the beginning of 
also at the beginning of the eighth grade, has mad 
each case, he is being compared with students of hi 
he has maintained the same relative status within his grade level group. 
Another disadvantage of percentile scores can be noted from an examina- 
tion of Figure 2.2. A difference of 20 percentile points near the middle of 
the distribution гергеѕепіѕ a relatively small difference in raw Scores, 
whereas a difference of 20 percentile points near either extreme represents 
à much larger difference in raw scores.” This characteristic of percentile 


Scores must be taken into account in interpreting test data; or one would 
tend to overestimate the si 


near the avera 


in disadvantages. 
for example, a student 
the seventh grade, and 
e normal progress. In 
5 own grade level; and 


ss (both indicating achievement 


; Tepresents a much smaller variation in 
achievement than the difference between P, 


„аге 
marked differences in height correspondi 


ng to a difference of ten per- 
centile points. For example, the tallest ma 


n would be conspicuously taller 

?For example, in Figure 2.2, the difference of 20 
and P, represents a range of only 6 raw scores from 
of 20 percentile points from Р 
that is, from 86 to 98. 


percentile points between Ре 


75 to 81; while а difference 
» to Р» represents a difference of 12 TàW score points, 


Cumulative Percentage 


Test Data in Terms of Converted Scores 


100 


a ~ сл Oo ~ 
= = оо о б © 


0 
0 
а 56.5 Score on State History Test 

Fig. 2.2 Ogive, or Cumulated Percentage Curve, for Scores on 


State History Test (based on cumulative percent column in Table 
2.5) 


NOTE: To estimate the percentile equivalent of a raw score, take 
a card, position the right edge of the card at the raw-score value 
and perpendicular to the base line of the graph. Move the card 
up or down until the horizontal edge intersects the curve. With the 
card properly positioned, read the desired percentile point on the 
percentile scale. In this case, with the right edge perpendicular to 
the base line at X — 70, the card wovld intersect with the curve 
at a point corresponding to a percentile point of 21. To find the 
score value corresponding to any percentile point, position the 
upper edge of the card at the percentage value desired. Move 
the card over until the right edge intersects the curve. Then with the 
card properly positioned, follow down to the base line and read 
off the score value. In this case Р;; corresponds to a score value of 
approximately 84.5. Note that the value of the cumulative per- 
centage is plotted at the upper limit of each score interval and 
that the lower and upper limits of each interval are respectively at 
Ye unit below and above the score points shown. 


31 


32 THE EVALUATIVE PROCESS 


than the man ranking at the 90th percentile, and the shortest would be 
markedly shorter than the man at the 1Oth percentile. For further clarifica- 
tion of this concept, see Figure A.1 in Appendix D and the accompanying 
explanation. | | 

Mr. Smith also realized that he could not average percentile scores 
directly. To obtain an average percentile score for two or more individuals, 
one must first obtain the raw score equivalents of the percentile scores, 
average them, and find the equivalent percentile score for the average 
raw score. 

Up to this point, we have used the term “percentile score,” but the 
student will find many test manuals using the term “percentile rank.” The 
term “percentile rank” has a somewhat different meaning than the term 
“percentile point” computed in Table 2.5. The directions given in that 
table enable one to compute each of 99 percentile points, which divide the 
frequency distribution into one hundred groups containing equal numbers 
of cases. The “percentile rank” is a converted score for a span of raw 
score values, centering around the percentile point; a PR of 2 is the con- 
verted score for raw score values centered around Р,. The following chart 
illustrates the difference between the two terms and also shows why the 
terms 99+ or 1— appear in norms tables of test manuals. 


PERCENTILE P4 Р, Ps Pss Pis Pag 
POINTS 
Lowest Best | Next Next Мех Highest 
1% of lowest | lowest highest | highest 1% of 
2 OF | 1% of | 1% of 1% of | 1% of | 79 
cases cases 
cases | cases cases | cases 
PERCENTILE 
meee f= 1 2 | 98 99 99 + 


Graphic explanation of percentile points and percentile ranks at 
the upper and lower extremes of a frequency distribution 


When we are dealing with the division of a range of raw scores into 
one hundred parts, the distinction between percentile points and percentile 
ranks is a minor distinction. However, when we are dealing with the 
division of a distribution into tenths, the distinction becomes very im- 
portant. For example, the raw scores that would correspond to a decile 
rank of 1 for the state history test would be scores that would include 
5 percent of the cases on both sides of Р or D;. Although decile ranks 


are infrequently used, the example is included to help the student distin- 
guish between these two concepts.'^ 


10 Ја Table 2.5, Pi = 65.5. Decile 1 would include scores from 63 to 68. See 


Howard B. Lyman, Test Scores and What They Mean (Englewood Cliffs, N. J.: 
Prentice-Hall, Inc., 1963), p. 118. 


Test Data in Terms of Converted Scores 33 


DECILE D, D, р; р; D, D, 
POINTS 
DECILE 
2 
RANKS g | 8 9 10 


NORMALIZED STANDARD SCORES In an earlier section of this chapter, 
we used the SD as a unit for expressing test scores in a common frame of 
reference, which compensates for differences in variability between dis- 
tributions of test scores. We have shown how to compute z-scores and 
T-scores so that student scores on different tests are expressed in com- 
parable terms, that is, in terms of how much these scores differ from the 
mean (in SD units). We have obtained percentile equivalents of raw scores 
by computation and through the graphic method. 

We would now like to show how percentile ranks and normalized stand- 
ard scores can easily be obtained through the use of the Otis Normal 
Percentile Chart, if the distribution of test scores approaches normality. 
First, it is desirable to examine some of the characteristics of the *normal 
curve." (See Fig. 2.1.) 


1. Characteristics of the Normal Curve. The normal probability curve or nor- 
mal curve is a theoretical curve. However, the frequency curves for many 
attributes (which are affected by a multitude of interacting factors) ap- 
proximate the normal curve. For example, many biological characteristics, 
such as height, width of shoulders, and the like are normally distributed, 
probably because of the myriad possibilities of different. chance combinations 
of genes affecting these characteristics. Also many distributions of test scores 
approximate a normal distribution, especially when a large number of per- 
sons have been administered a test designed at an optimum level of diffi- 
сину“ And, as we shall see in the next chapter, a large number of repeated 
measurements of any characteristic, such as an individual's scores On ге- 
peated test samples of vocabulary, tend to be normally distributed, because 
of chance combinations of different measurement errors. 

The fact that the normal curve is a theoretical curve that can be exactly 
described once the mean and SD are known, makes it exceedingly useful in 
the development of comparable scores for many diverse variables. For 
example, if such comparisons were useful, one could compute comparable 
standard scores on such diverse variables as height, speed of running, 
knowledge of vocabulary, and attitudes toward. school. That is, one can 
define each individual's place on a normal distribution for a specified 
population and determine how much more or less he differs from the aver- 
age in each variable. 


11 Optimum level of difficulty is discussed further in Chapter 5. For most pur- 
poses, the best level of difficulty is one which allows for maximum differentiation 
among the persons tested. This maximum differentiation is achieved when the 
average "number correct" is halfway between a chance score (for example, one 
half the number of items in a true-false or other two-choice test) and the maximum 


possible score. 


34 THE EVALUATIVE PROCESS 


Since the normal curve is so significant, we should note some of its major 
distinguishing characteristics: 


a. The largest number of cases are clustered in the center of the range, 
with the highest frequency at the exact center; thus the modal (most 
frequent) score is the same as the mean score, and also the same as 
the median (or midscore). A perpendicular line erected at the mid- 
point divides the area under the curve (and the number of cases it 
represents) in half. | | 

b. The curve is symmetrical, each half of the curve being the mirror 
image of the other. 

c. The shape of each half of the curve changes from convex to concave 
at a point 1 SD above and below the mean. About 68 percent of the 
area (and the cases) lie within one SD (plus and minus) of the mean. 

d. Tables exist that give much valuable information about the character- 
istics of the normal curve. That is, once we make the assumption that 
our data are approximately normally distributed, we can make a 
number of helpful inferences. For example, there is a definite relation- 
ship between (1) any z-score, and (2) the proportion of the area (or 
cases) which that z-score exceeds. 


As an aid in conceptualizing the way in which “area under the normal 
curve” represents number, or proportion, of cases, the student is referred 
to Figure 2.3. In this histogram,!? each case is represented by a small rec- 
tangle; hence it is apparent that the proportion of area represents the pro- 
portion of cases. The reader will note that since the distribution of history 
scores is approximately normal, the proportion of area (and cases) to the 
left of — 1 SD (ог to the right of + 1 SD) is approximately the same as that 
found in a normal distribution. 

From the left-hand side of Table A.1 in Appendix D we find that in a nor- 
mal distribution a z-score of 0 exceeds 50 percent of the area (or cases); а 
z-score of 1 exceeds 84 percent; a z-score of 2 exceeds 98 percent; and the 
like. In the right half of the table, we find that a z-score of — 1 exceeds 16 
percent of the cases and a z-score of — 2 exceeds 2 percent of the cases. 
Any other intermediate values for z-scores can easily be changed into per- 
centile scores and vice versa. Examination of Figure 2.3 will show why these 
z values are equal to these percentile scores. 

With Figure 2.4, on the other hand, in which the frequency distribution 
of scores is not symmetrical, but is skewed or asymmetrical, the percentages 
of cases in these same segments of the frequency polygon'? (below — 1 SD 


and above + 1 SD) do not approximate as closely the percentages in the 
normal curve. 


12 [n a histogram, there is erected at each score interval a vertical bar representing 
the number of cases in that interval. 

13 A frequency polygon is a simpler form than the histogram for showing the 
form of a frequency distribution. In the frequency polygon, the frequency for each 
interval is represented by a point plotted over the midpoint of that interval. The 
polygon is completed by connecting adjacent points with straight lines. At each 
extreme, the frequency curve is dropped to the baseline at the midpoint of the 
interval just above (or below) that containing the highest (or lowest) scores 


respectively. 


Test Data in Terms of Converted Scores 35 


Frequency 


= 


54-88 57-39 60-6: 63-03 бе 
M-150:68 M* 18 
Z=- 2:0 


а) (2) (3) 


History Test Scores 


(1) 19 cases*, or 16 percent of cases and 16 percent of area, аге below 68, or a 
z-score of — 1 in this distribution 
16 percent of the cases and 16 percent of the area are below — 1 SD in a normal 
distribution 

(2) 81.5 cases, or 68 percent of cases and 68 percent of area, are between 68 and 
88, or between z-scores of + 1 and — 1 in this distribution 
68 percent of the cases and 68 percent of the area are between +1 SD and 
— 1 SD in a normal distribution 

(3) 19.5 cases”, ог 16 percent of cases and 16 percent of area, are above 88 or a 
z-score of + 1 in this distribution 
16 percent of the cases and 16 percent of the area are above +1 SD in a normal 


distribution 


Fig. 2.3 Histogram for Distribution of Central High School Scores 
in the State History Test (designed to illustrate the concept that any 
specified area under a frequency curve represents the frequency, 
or percentage of cases, within its score boundaries). 


а Since 68 is 1/6 interval below the upper limit of the interval 66-68 (or 65.5-68.5), 


only 5/6 of the cases in the interval were included. 
P Since 88 is the midpoint of the 87-89 interval, one half of the cases in the 87-89 


interval were included. 


. Linear and Area Transformations of Test Scores. The z-scores that we com- 


puted by formula, and the T-scores derived from them, are linear standard 
scores obtained through linear transformations (multiplying the original 
scores by a constant, and/or adding a constant to them). That is, the raw 
scores are translated by a formula to converted scores that preserve the 
original relationships between the raw scores. Equal differences in converted 
scores are proportional to equal differences in raw scores. The information 


36 


THE EVALUATIVE PROCESS 


about differences in performance between students is preserved. If raw 
scores are plotted against their converted z-scores or T-scores, the points 
representing the pairs of raw and converted scores fall along a straight line 
(or show a linear relationship) (Figure 2.5). 


Frequency 
о 


84-56 27-50 60-82 63-05 66-03 69-11 


788 87-8» 90-97 53195 95-95 


M- 151 3.5 M 50:94.5 
20 Z:*l 
———————) => 
(1) (2) 
Arithmetic Test Scores 
(1) 23 cases*, or 19 percent of the cases, are below 72, or a z-score of — 1 in this 


distribution 


16 percent of the cases are below a z-score of — 1 in a normal distribution 


(2) 27 cases, or 22.5 percent of the cases, are above 94.5, or a z-score of +1 in 


this distribution 


16 percent of the cases are above a z-score of + 1 in a normal distribution 


Fig. 2.4 Frequency Polygon for Distribution of Central High School 
Scores in Arithmetic Test (designed to show skewed distribution 


obtained when test is too easy to discriminate adequately among 
high-achieving students). 


“Since 72 is 1/6 interval above the lower limit of the interval 72-74 (71.5-74.5), 


1/6 of the cases in this interval were included. 


b Since 94.5 is 1/3 interval below the upper limit of the interval 93-95 (92.5-95.5). 


1/3 of the cases in this interval were included. 


In converting raw scores to normalized standard scores, however, the 


procedure is quite different. Raw scores are first converted into PR's 
(representing the proportion of cases or area exceeded); and these PR's, 
in turn, are converted into equivalent standard scores in a normal distribu- 


tion. For example, in the following table we show the PR's for three raw 
scores on the state history test. 


Test Data in Terms ој Converted Scores 37 
LINEAR 
NORMALIZED Z-SCORES 
PERCENTILE z-SCORES (OBTAINED BY 
RANK (OBTAINED FORMULA) 
RAW SCORE FROM FIGURE 2.2 FROM TABLE А.1) TRANSFORMATION 
70 21 —0.8 —0.8 
76 40: | —0.25 —0.2 
85 75| | +0.7 +0.7 


If we use Table A.1 (based on percentile ranks or 


areas under the normal 


curve corresponding to different z-score values), we obtain the normal- 
ized standard scores shown in the third column. If we obtain z-scores by 


X—M 


means of the formula ( z= 


in the last column). The differences are negligible i 


) they are almost identical (as shown 


n this case because the 


frequency distribution of scores on the local state history test closely 


approximates a normal distribution. 


100 


T- Scores 


20 30. 40 50 60 70 
Raw Scores 


0 
0 


10 


Fig. 2.5 
Raw Scores and Their Linear Transformations (2- 
State History Test Data from Table 2.6. 


Z — Scores 


80 90 100 


Illustration of Linear (straight-line) Relationship between 


scores and T-scores). 


THE EVALUATIVE PROCESS 


38 


"uoissiuujad [отоәаѕ Aq pe»npoudey 'реллезол 4461 jy 'шрила p319 ш јубилдо 
"AN Эрод MAN “эщ 'рром 8 201g чапозлон Aq 826! 14бмАдо5 's4O `$ плу Aq Hoyo гјцџезлед |јошлом 
's[oou»s ||v 404 юіоӣ—ѕәл025 ріориюцс̧ pezijpuuoN бшшоаО ч! оэ e[uue21eg јошлом eui jo as 9'z “Bly 


Eu E ea smog onead parpuvig 


02 ar ПИКЕ: T [errn 
e HHHH 


8E-9E 
le-6£ 
9-26 
4b- 
06-8P 
21 | 6-16 

4| 95 za 
92| 69-26 
92 29-09 | 
ef) $9-£9 
02| 89-99 
0€| 14-69 
8b) #41-21 
Ob) 44-57 
02| 08- 22 
SE) 9-18 
62| 99-9 
52| 68-48 
0£| 26-06 
12| S6- £6 


I| 86-96 


отера | кгелләўит 
rang [bu] 2402; 
I 


HH 


чо MH] NN] ~) 


t- 


v 


8 04 09 OS Ot Of Ог т 
"VOS  SILLNHSONHd 5 


EDD bul 
| £97 8176 NOLUVAIWTX 49015 zu] 005 3pojb «| I 
Е Ла оче) чәшщшехд ET uio 'uoreururexq) anseajq Sjo oN dnog ло зрело) 
9] LUVHO WILLNHONWd TWWUON 


Test Data in Terms of Converted Scores 39 


Table 2.6 
Percentile Ranks, T-scores, and Stanine Scores for the State History Test 


for All Schools (as read from the Normal Percentile Chart Figure 2.6) 


————————————————————— 


PERCENTILE T- STANINE PERCENTILE T- STANINE 
SCORE RANK SCORE SCORE SCORE RANK SCORE SCORE 
——————————— 
45 & below 1 27 72 43 48 
46—48 2 30 51 73 47 49 
49-50 3 31 74 50 50 5 
51-52 4 33 75 54 51 
53 5 34 76 56 52 
54 6 35 77 59 52 
55 7 35 2 78 62 53 
56 8 36 79 65 54 
57 9 37 80 68 55 
58 11 37 J 81 71 sep 6 
59 12 38 82 74 57 
60 14 39 83 77 58 
61 16 40 84 79 58 
62 18 41 3 85 81 59 
63 20 42 86 83 60 
64 22 43 87 85 61 1 
65 24 43.) 88 87 62 
66 27 44 89 88 62 
67 30 45 90 90 63 
68 32 46 91 91 64 
69 34 aor ^ 92 92 64 
70 37 47 93 93 OSa i 
71 40 48 94 94 66 
95 95 66 
96 96 68 
97 & above 99 1335 9 


“If a normal percentile chart is not available, normal probability paper can be used to 


obtain T-scores from a graph (with raw scores being plotted on the horizontal axis and cumu- 
lative proportions or percentages on the vertical axis). See J. P. Guilford, Fundamental Statistics 
in Psychology, third ed. (New York: McGraw-Hill Book Company, Inc., 1956), p. 498. 


Normalized standard scores are obtained by a process known as “area 
transformation." In other words, a percentile score (representing a cumu- 
lative percentage or area) is used as the basis for obtaining a standard 
Score. When area transformations are used, we are forcing our distribution 
of test scores to fit the shape of a normal distribution. If the frequency 
distribution of raw scores is approximately normal, normalized T-scores, 
or T-scaled scores," will be similar to T-scores obtained by formula. 


14 Use of the term “T-scaled score,” is recommended by Lyman to identify the 
normalized T-score, based on area transformation, as distinguished from the T- 
Score based on linear transformation. Howard B. Lyman, Test Scores and What 
They Mean (Englewood Cliffs, N. J.: Prentice-Hall, Inc., 1963), p. 115. 


40 THE EVALUATIVE PROCESS 


The following quotation from Anastasi indicates not only the condi- 
tions under which normalized standard scores may be used but also the 
desirability of adjusting the difficulty of test items so that the distribution 
of test scores more nearly approaches normality. 


Normal standard scores are standard scores expressed in terms of a distribu- 
tion that has been transformed to fit a normal curve. . . . Such a transformation 
should be carried out only when the sample is large and representative and 
when there is reason to believe that the deviation from normality results from 
defects in the test rather than from characteristics of the sample or from other 
factors affecting the behavior under consideration. . . . Whenever feasible, it is 
generally more desirable to obtain a normal distribution of raw scores by proper 
adjustment of the difficulty level of test items, rather than by subsequently 
normalizing a markedly non-normal distribution. [italics added]"* 


For example, if the arithmetic test were revised to include more difficult 
questions, the individual differences among students whose scores are now 
piled up at the high-score level would be more adequately measured. Then 
a more nearly symmetrical distribution of scores would be obtained, and 
the more efficient process of obtaining PR’s, T-scaled scores, and other 
normalized standard scores by means of Table A.1 or by means of the 
Otis Normal Percentile Chart could unquestionably be used. 

We can easily obtain PR’s and the equivalent normalized standard 
Scores if we graph the cumulated percentages on an Otis Normal Per- 
centile Chart (Figure 2.6). In other words, a table of local norms for our 
state history test can be easily obtained by graphing the cumulated рег- 
centages for each interval and simply reading off the desired PR’s (top 
scale) or z-scores (bottom scale). A T-score scale or other types of 
normalized standard score scales could easily be added to the graph. On 
this kind of chart, intermediate values „сап easily be obtained, since a 
normal distribution appears on such a-chart as a straight line. The values 
read from the graph for the state history test are shown in Table 2.6, 

The graph was based on the data for all 500 students in the school 
district, shown in Table A.3, rather than those for Central High School. 
The reader will note by comparing Table A.3 with Table 2.2 that the 
mean score for the school district is lower than that for Central High 
School; and as one would expect, the SD for the school district is sub- 
stantially larger than that for a single school (which would naturally be 
more homogeneous than “combined schools” with respect to achievement 


on any test). Table A.3 also provides a computing guide for obtaining the 
SD by the short method. 


15 Anne Anastasi, Psychological Testing (New York: The Macmillan Company, 
1954), p. 94. Е 


Test Data in Terms of Converted Scores 41 


STANINE SCORES For many purposes, a simpler type of one-digit stand- 
ard score, called the stanine score,!5 is desirable. In the stanine scale, raw 
scores are converted to a nine-point scale, with a mean of 5 and an SD of 2. 
These stanines represent approximately equal steps on the raw-score scale 
and can be used with any data that can be arranged in rank order. Stanine 
scores obtained for one test are comparable! with those for any other test 
administered to the same group of students. Stanine scores can be averaged 
and used in other types of mathematical computation. 

Stanine scores can be assigned to groups of raw scores by using the 
Otis Normal Percentile Chart, or simply by assigning stanine scores 
sequentially to raw scores, which have been ranked or tallied. The follow- 
ing distribution is used: 


STANINE 

SCORES 1 2 3 4 5 6 7 8 9 
РЕКСЕМТ Next | Next | Next Next | Next | Next 

AT EACH | Lowest! lowest | lowest | lowest | Middle | highest | highest highest | Highest 
LEVEL 4% 7% 12% 17% | 20% 17% 12% 7% 4% 


These percentages represent the areas included in defined segments of the 
normal curve, % SD in width, as shown in Figure 2.7. If a teacher is 
assigning stanines to a distribution of test scores, he will not be able to 
follow these percentages exactly, assigning a score of 9 to the highest 
4 percent, a score of 8 to the next 7 percent, and the like. Since one must 
obviously assign the same stanine score to all students receiving the same 
Taw score, only an approximation of these percentages is possible. 
Diederich!* suggests that teachers, in using stanines for their day-by-day 
equating of scores for recording in rollbooks, employ the following dis- 


16 The term "stanine" is used because it is a STAndard NINE-point scale from 
9 (high) to 1 (low). One publisher to date (Harcourt, Brace & World) is using 
Stanine norms with several of its standardized test batteries and has prepared a 
leaflet (free on request) on the use of stanines: Walter N. Durost, *The Character- 
istics, Use, and Computation of Stanines," Test Service Notebook, No. 23 (New 
York: Test Department, Harcourt, Brace & World, Inc., 1961). Many applications 
of the stanine technique are also explained in Walter N. Durost and George A. 
Prescott, Essentials of Measurement for Teachers (New York: Harcourt, Brace & 
World, Inc., 1962). 

17 When we say that scores are "comparable," we are not implying that they are 
equivalent and can be substituted for each other. Two scores may be comparable 
апа yet reflect quite different abilities. Two scores are comparable if they represent 
the same standing, or rank, in the same population. 

75 Paul B. Diederich, Short-cut Statistics for Teacher-made Tests, Evaluation and 
Advisory Service Series No. 5 (Princeton, N. J.: Educational Testing Service, 1960, 
p. 37). 


42 


THE EVALUATIVE PROCESS 


STANINE 


9 
8 
7 
6 
5 
4 
3 
2 
1 


+1.75 SD units & above 
+1.25SDto+1.74SD 120-127 


+ 


RANGE (IN SD 


PERCENTAGE RANGE OF 


AT EACH 


UNITS FROM MEAN) IQ RANGE LEVEL 


-75 SD to +1.24SD 112-119 
.25 SD to + .75SD 104-111 
.25 SD to + .25 SD 96-103 
45 SD to — .25 SD 88- 95 


—1.25 SD to — .74 SD 80- 87 
—1.75 SD to —1.24 SD 72- 79 
—1.75 SD and below Below 72 


aimo 


128 & above 4 


7 
12 
17 
20 


PERCENTILE 
RANKS 


97 & above 
90-96 
78-89 
61-77 
41-60 
24-40 
12-23 
5-11 
4&below 


Fig. 2.7 Stanines for IQ’s for a Group of Students with a Mean 
of 100 and an SD of 16 


tribution, simply because the multiples of 4 are easier to remember and 
result in inconsequential differences in the converted scores. Note that only 
slight changes in the four percentages underlined are involved. 


STANINE 
SCORES 1 2 3 4 5 6 7 8 9 
PERCENT 
AT EACH | 4% 896 12% | 16% | 20% | 16% | 1295 895 496 
LEVEL xem 


Stanines are often preferred in the interpretation of test data to students 
and parents because of the ease with which they are understood. More- 
over, the coarseness of the unit reduces the chance of overgeneralization 


Test Data in Terms of Converted Scores 43 


on the basis of small differences in raw scores. Stanines also seem to be 
ideally suited for weighting and summarizing data of a wide variety of 
types, as in obtaining composite scores to be used in assigning marks, in 
homogeneous grouping, or in the selection of students for special classes. 

As we have discussed normalized standard scores and the ways of obtain- 
ing them, the reader has probably become aware of a convergence among 
the types of scores discussed. That is, linear standard scores are almost 
identical with normalized standard scores when the distribution of raw 
scores approximates the normal curve (as it should when large numbers 
of students are tested with a test of appropriate difficulty). The reader has 
also seen that there is a definite relationship between percentile ranks and 
normalized standard scores. Furthermore, 7-scores and T-scaled scores 
are seen as convenient transformations of their respective z-scores, used 
simply to eliminate the inconvenience of decimal points and negative num- 
bers. Stanine scores are single-digit normalized standard scores, which are 
especially convenient for certain purposes. 

The reader is now ready to study Table 2.7 and to examine Figure A.1 
and to note not only the relationships already discussed, but a variety of 
other standard scores that have been developed for specific purposes."? 


Converted Scores Based on the Average Grade or Age Status of 
Examinees Obtaining the Same Score 


We will now consider the grade and age equivalents that are used so 
extensively with tests of achievement and scholastic aptitude. A converted 
grade or age score indicates the grade or age for which the student's test 
performance is typical. Age equivalents are stated in terms of the age 
(in years and months) for which test performance is typical, a mental age 
of 6-8 indicating a score on an intelligence test typical of a child 6 years 
and 8 months of age. A grade score or grade equivalent indicates the 
grade and months of attendance for which test performance is typical, a 
grade equivalent of 6.8 indicating a score typical of pupils who have 
attended sixth grade for 8 months. Here a decimal point is used, the year 


19 Тһе reader is warned, however, not to assume that CEEB (College Entrance 
Examination Board) scores are equivalent to the other scores shown, since the 
average score on students applying for college entrance would be considerably 
higher than that for persons in the general population; one could assume that the 
variability or SD would be less than for the general population. Also the AGCT 
Score distribution would have proportionally fewer persons at the lower levels of 
mental ability than the general population. Hence, an AGCT score of 60 un- 
doubtedly represents a higher IQ than 70. With these exceptions, however, the scores 
listed under each other in Figure A.1 are comparable. 


Sure 
moyin 
ШЕШ! 
pouinb 
-ол әле 5085 snunu 
рие sjujod [eurooq 


]ur122ads 
jouduojur 0} 


рәшшихә st 
uonnquisip Áouonb 
-o1) sso[un 5лофо O} 
jodsoi quA snes 
sooupuexo ио uon 

-ешлоуш Ou ppold 
AynouytP 
ur Ava 51521 дОШ$ 
до] 0} 189) WOI 

ојаелед шог jou 592025 


THE EVALUATIVE PROCESS 


poseq оле 
Коца qorga uo 591025 
mer jo uonnquisip 
se advys owes әлец 


sa100s-2 JO suor 
-nquisip Аопопђола 
suon 


-ојаллоо pue ‘SaS 
*so8v19Av Зип 
-шоз ur рәп aq ио 
52105 
мел UI SoouologIp 
oj јепоплодола. oie 
$91098-2 ш so2uo19gKd 
1521 0} 159] шолу 
521025 jo uostind 
-wos gqissod омер (vc 


juajuoa 
js9) uo уиәшәләцәр 
Suru122u02 59202 
-зә}ш 10у 51509 opiAO1d 
poojsiopun Ацвед 
sooupuxo 19470 JO 
521025 Aq poouoangut зом (T 


pored 
-woo oq Ајотемдола 
-de иво ѕәәшшехә 
yaya Чим “(54912 
әп) 8ишшдод 10 
sid орелб-ш0р Se 
yons) Чпол8 Јоцо 
до ‘opis ‘adv әш 


(say 
oni»eds uo 21025 
yoajsad Áq рәшә$ 
-o1ido1 sv) puepuejs 
pourujojopoid uia 
uostivdwod '2пом 


as 
W-X 
попемод ртери®1$ 
uvayy — 22025 AI 
(ro 


qs ‘0 o) parenbe W) 
591025-2 (eg 


em 


sumo} JO ‘ON, 
001 х эчи "ON 
yoarsoo обејпоолод (1 


=X 


а(зоәләц} 
e[drnur әшо$ 
зо чип gg ш 
—ивош оф pue 
91025 5,2ошшехо 
uəəm}əq  оополој 
-jip jo suu) ur) 
$әәшшехә 19470 
WIM uosueduio) (с 


pie 
-ривуѕ Атеш 
цим uosnedwoo (T 


SHDVLNVAGYSIC 


SHOV.LNVACV 


5апомо уччом 
ЛУІЗА, 


599025 daLYaANOD 
ЯО SAdAL МОГҮИ 


NOSRiVdjNOO 
Яо HdAL 


а 


44 


,SisiBojoupÁsq pup 5лојо2пра 


LT 91991 


Áq рәѕ sisal 20} ѕәло25 рәнәлиогу jo зодА] solow 


45 


Test Data in Terms of Converted Scores 


5,15 
*sodv1oA? Surnduioo 


ur posn oq Jjouuv 
1018 

JO одпзеош v sv 

jeidiojur 0j пошла 


501025 
моолјод soouoiopip 
ојаецолип 0j ооџео 
"rudis Виппа! ла 
“ye 5,лозп jo prezey 
Sosvoloul yun [рш 

501025-2 Јој uey} 
SNOIAGO 559] Зшивәуү 


(so109s-], 10} Se ‘OT 
uey} лоцјел '9[ 10 
SI St qS ‘st те) 
91521 Апрде Јәщо 
рит juouroAoIQoe 
чо pourejqo 591025 
piepuejs WIM әде 
-ледшоз Ápoip JON 


за ЧИМ uorsnjuoo 
jo ysu ејиә son 
„тел JO o3uei рив 0215 
51005-2 10] uey} 
Snoraqo sso[ Suruvayy 


poinduroo Ajiseq 
Sjuo1ed рив sjuopnis 
sv yons 'juou 
-әлпѕвәош ur poule] 
-un suosiod Aq пәлә 
*poojsropun Ayiseq (ve 


5ә1025 МЕЛ UI 50009 

-1opip [jews yayor 

0} 591025 ројголпоз 

10} o[qtssod 31 5оеш 

(qs 10'0) иип pews 
51104 

јешоәр pue 5:85 

ѕпшш jo uoneu 

-їшцә sn[d *so100s-2 

JO} pasi sodvjuvapy (PZ 


5,01 oner op uey 

оде oj обе шолу Аи 
-паеледшод 19je218 

олец SOT чопемод 
501025- J, рив 501025-2 

10j pasy soSvjuvapy (2с 


sjurod jew 
-199p pue sudis snu 
-wu jo uomeuruio 
ay} 5114 521025-2 
10} pois sosejueapy (92 


роледшоз 
од Áporeridouddv џез 
зоошишехо цощл 
Чим dnoi3  лошо 
Jo 'орел) ‘ade әш 


тубТ 
ur suornvuruevxo 


AAA Buryey syuopnig 


Тәлә ade ошо 
-ods v уе uonv[ndod 
үеләиәЗ əy} jo Burd 


-wes олпејиоволдом 


олоде S? oles 


ospueq o[nuoo1ed 
(Sd) syur əmud (VE 


005 

+ 2001 = 21025 4480 
(001 оз 

as ‘oos оз роуепбо py) 
51501 pivog иоцеш 
-umx;j зошелиц 2891 


-J09 uo 591025 piepuveiS (ре 


(001 + 291 

= OL Pug piojueig 
001 

+ 261 = OI ловом 
јаше 


-Xə 10} ‘OT 10 сү Ol 
as '001 оз poyenbo jy) 


SOI чопета (o7 


05 + 201 = I 
(01 o 
as ‘Os о} рајепба py) 


591025-7 (9 


(dnoi3 
UIYIIM 91025 
5,ооштшехо jo 
uei JO 5шлој ur) 
зоошшехо logo 
чим иозиейшогу (є 


__________________________________________________________- = 


5лоулмулау5та 


SHOVLNVAGVY 


5апомо чом 
‘TVOIdAL 


599005 GALYAANOD 
ЧО 54441 ЧОГҮИ 


NOSDhiVdjNOO 
dO HdAL 


ФФ 


THE EVALUATIVE PROCESS 


46 


обџел 21025 JO puo 
зәциә ye 521028 jo dn 
-opid v цим ‘pomays 
Kjonugop әле suon 


poseq әле ҝәці 
цол uo 521025 
мол JO suonnqrg 
-sip uey} [euriou 


-nqunsip | Áouonbouj ÁK[reou олош 59105 •(01 0} 
џоцм ојдаеип5 JON po[u2s-7 JO uon as ‘05 01 payenba jy) 
adurs ләдшпи owes -nqiysip  Áouonboij (uounquisip 


osn yya ‘SOIS 


ма YIM posnj 
-uoo 8шод JO piezuH 


омеш ,suonvuloj 
-suv1) бәле jet 1492 
-xə 521025- Z, 0} IEUS (96 


јршаои D и S Yd Jo 
sjuo|vAmbo 21028 pie 
олоде SE oumS  -puvjs) pso102s рәјеоѕ-г (ЧЄ 
вошолхо WAU SAd 
uooM]aq soouoioip 
jo oouvoylusis 8ш 
iseydwasopun рив 
ишрәш шәй SAd 
џаолмод soouolopIp 
jo aouvoyludis 8ш 
=десцӣшәјәло JO YS 
peurjqo arom Хоча 
yom Шол] 591025 
мед оф ur 50009 
-19)j1p. enba 1195 
-олдол jou Op Sd 
ur soouarayip enba 


sdnoi8 wou 
jo Дриюл v pue 
sjsoj Jo ÁjonmA т 
oj e[qeor[dde  Á[optAA 


=== -=__———— 


SHOV.LNVACY 5апомо мном 


7VO9IdAL 


SINOS GALYAANOD 
ЧО SHdAL ЧОГУ 


NOSIYVd WOO 
ЧО adAL 


S30 V.LNVAQVSIG 


,sisBoJoupÁsg рио s101D2np3 Ка pasp sisaq 10} ѕә102 рәнәлиогу jo sedA| solow 
(репицио5) z'z 91491 


47 


Test Data in Terms ој Converted Scores 


Япо: 
wou pue 1591 puok 
-9q Зштцилоцод oj 
pea] иво дишвош jo 
ssəusnoraqo juorvddy 
51004 [елоло5 ләло 
Чумлол8. Aprays {әл 
-#[ә1 ‘snonunuos 
Surmoys soni[mqe 
Joy Áļuo ојемдола у 


sonjvA 
jua1ogrp әши Áquo 
Suravy “ип 25105 

juoso1d 
ye posn Ájəpım зом 


saouariod 

-Хә Suruseay шемәз 

10} 5$әшрюәл 5,рццэ 

SurpieSo1 soouasay 
-ur Suryew ur [пә 

S1eoÁ 

[00425 Алејџош 

әјә pue [oouosoid 

OY} ur џозрџцо 20у 

ÁKq[eroodso. ‘yard19} 


-ш 0} Asva Ápuoieddy (vp 


poziunurmu 51 521025 
AI UI Soouoloprp 
pews о} oouvayrudis 
цопш 00} Suryorye 
JO ysu ‘syuored pue 
Sjuopnis 0} едер 1521 
Виполдлојш ur [1195 
suoneynd 
-w03 лофо ur pasn 
pue роделоле Ayiseq 
asivod әле spun 
yey ydaoxa '501025 
piepuejs pozi[eur 
-10U jo sogvjuvape үү 
SAd ut 
зобивл ЈО sui] ш 
9[qv1921d.101u1 02115 
poojsiopun Á[rseq 
poyurs од иеэ jeu] 


хер Aue pm ојде5 (26 


Sdnoid әде oarssooong 


олоде se ошеб 


(21025 MPI әш 

Sururejqgo soour 

-Wexd jo snjejs 

opeig 10 оде оде 

(941 oy) pue “әде -1әл® JO SU119] Ш) 

]euornvonpo ‘sade $әәшшехә лофо 
тејџош) 501025 оду (vp ym uosneduo) (p 


45 
S'O 0} [enboa st souo 


әшәцхә OM] Əy} 1492 
-xə иип ошшеј5 yg 

(с 
01 45 ‘g oi рәјепЬә jy) 
(uonmquuisip тршаоди D 
и! Sd ut ѕә8иві jo 
sjuo|pAmbo 21025 pie 
-puvjs) ь8ә1025 ouruvejg (og 


— AAA ^ ————————  ————Áv———— 


SHOVINVAGYSIG 


SHOV.LNVACY 


5апомо WUON 


TWOIdAL 


544025 GALYAANOD мозгтуаиод 
dO 54441 NOfVIN dO adAL 


-—————————»—— ________________________ 


THE EVALUATIVE PROCESS 


48 


MENU EE L1 


juouov]d орел8 514 
олоде зорела дәл 
ло ому Ајзапооуо 
yom ор әре әд 
o} pownsse oq Арш 
Áowunoov jo дәләр 
usu] v smogs ОЦА 
quopnys 100) ur uon 


-vjoudgojursiu JO SIT 


олодю 
poreorpur 45 Ч! 
suonerivA Jo mo Sut 


«мол SOBLJUBAPLSIC 


(ojdures. Sur 
-wou шуим yuva 
jo swo; ш) Kyo 
-uodns yenba 109591 
-do1 jou op sisoiqns 
Jo souos v ш VO 
мојзд 10 олоде мәл 
ouo juosoidoi PY 
ga1098 '1591 01 1891 
шолу sanea GS 8205 


sojdures 
wou oAnvjuosor 
-do1 цим раледшоз 
se *juojuOO 1521 UO 
Suraatyoe әле sdno13 
uonja 18 [әлә] ош 
Зшиләәиоз $әзиәлә} 
-ur 10j 51809 зортлола 
влоцовај 0} леше) 


sp jeg wun v sesQ (ар 


juotdo[oAop JO 
ayer 8шрдебәл soouo 
-1jur 10] 51509 v 
sv оде 021801000102 
qn Aqmjsuruvour 
ролешоз 2q Av 


(29u23 
-i[|o1ur рог обе 
терош jo 25041 10 
‘adv јерош jo 25041 
*uonv[ndod [[nj 
uo poseq oq Аеш 
ѕушәшәовүі 9opv15) 

sdnoid 
opei8 ол155290015 


упошозејд-орело (ар 


мозтуаиоо 
JO adAL 


S30 V.LNVAGV SdnOND мом 


AVOIdAL 


SdHOOS AaLYAANOD 
до SHdAL могул 


SHDV.LNVAGYSIC 


,sisiBojoupAsd pup ѕ1оіоопрэ Ка рәѕр sisal 10} 5алозб рәрәлио jo sedÁ, лојруј 
(рәпициод) /°@ 91491 


49 


Test Data in Terms of Converted Scores 


"(6£61 'Aupduo5 
чојушовуј әчү :X104 MAN) уиәшәлпзоәуү '"|p53W "v шоці әәс ‘asa 
“SIP OfU! џајјој soy 4! jou; рәзоиб! Ajjuanbasy os uaaq soy 'лоломоц 
'uoNo»giads siu, foBp jo sapek әлјәтұ иәзрүцэә рафрајегип jo Чполб 
D 10} 4594 D uo џрош əy; оф peufisso aq Qg jo on|pA p ощ; pasod 
road АјошВџо '91025-] ayy jo idoouoo oui padojarap ou^ '|Dp252ow 


'591025 мол jo uonnquisip Азџепболј oui 
шол adpus ш J9g!p ||! (591025 pojoos-| 50 yəns) ѕә2025 puppunis 
Рә21|ошлои jo uonnquisip Əy; jo odous əy; "jpujou Ајојошхоладо 
51 $010295 мол jo uounquisip јошбџо ou ѕѕәјир 'ало25 мол ощ 
^o|9q s||pj joy} ənə 10 иобАјоа /ouonbojj əy; ларип решо əy jo 
uonjiodoid ayy jo озпозод '/ јо әшир4ѕ p 10 '09 jo e102s рә|оэ%-] о 
peubisso oq рјпом ојашог Burwsou ayy ш so1025 әщ jo јџа2лод pg 
5рәәәхә {оцу 94025 мол D “5! {оцу 'иоцпашир јошлоџ D и! зрело JO 
soBpjuo»1od Әлцо|ашпэ оф puodso110» joy} 501025 puppupjs бшриц Áq 
рәш!о}до әло Aot ouis '5иоцошлој5ио рәр әзо soJo»s рәләр 5941 

иип 
jo АиүопЬә awnssp 404; ѕиоцоџпашоэ зәщо ш pesn 10 робрзело 
әд jouuo» 01025 аролб рир '5ал025 обо 'syd 'абиол 21025 әщ jo 
зошелха BY} лреџ Syd UBAMJIqG so»uslogip jo ooupayiuBis Əy} 2215 
-pydwasapun ој you Pup ‘(aso2spiw 40) ию!рәш ayy лози Syd иәәмцәч 


зезџелејир jo e2uD2yIuBis ayy azisoydusaseco о; jou ѕпоцпоә ÁA|[pi»edso 
әд oj soy euo 'syd jo uonpjo1dioju! Əy} u| 'so1025 MDI ul so»ua 
-19g!p jonba 4uesejdej jou op (seso2s oppiB 10 әбр 10) syd и! so»ua 
-19gip јопба 'лалоглоуј '|DA19jU! Чорә и! 581025 jo зобојџозлод |pnbo 
ошо цим 'зоүпбиюрәз A[ojourxoaddpo sı squpa e[uue»ied jo иоцпдіщ 
51р Aouenbejj p 420) uj 'рошојдо asp Aayy ylym woi) 581025 MDI 
Əy; jo uounquisip Азџепбал əy; шолу одрце и! злодир (se1os еролб 
puo 'ебо ‘s,yq) 521025 ролџор әзәщцщ 10; uonnquisip АзиәпЬәл} у š 


"591025 мол jo uoynquysip ui и! sp 2620] sp ѕәшц 
D 910 gg рио ирәш Mau ayy '(5) jupjsuo» о Aq рецаупш sı e105s 
Чэрә меча 'ребиоцзип si gs әчү pup '(5) фипошо 404} Aq рәѕрәғә 
ти! 51 џрош Əy} ‘8109s әрә oj рәрро s! (5) jubisuoo D џоцм 'suon 
-Du10jsupJj лодм лә 20 soJ005 мол Əy} sudpjB ouo 1euieuw wps 
оу; eq |р uounquisip Aouanbejj eui jo adoys әчү *jupjsuoo aus 
ays Aq perdunu asp sojo»s ||p џецм 4o/pup 5ел025 цо oj рарро 
51 jupjsuo» әшоѕ əy} uəym решојдо auo sı uonpuuojsubi лови 
у 7591005 мол jO 5иоцошлојзирд JDƏU 91D 59105 ралџер ƏSƏ, , 


'(661–06 "dd ‘E961 “зи "||pu-o»tiueig гг ^N ајд poom 
-ә|биз `ирәуү Хәцу фоцм рио ѕәлоэс̧ 459] ‘UDWA] ‘g рломон) ‘asn 
MOU 594025 ц2п5 ||p jo ұшәшщұоәц әліѕиәцәзішоз SIY и! saJo»s рајголиоз 
19gjo Aupw sopn[pu! oym ‘unwA7 Aq рәдојәләр {ощ jo чоцеоџоро 
ређуашњ p s! aqo} siy} jo Z pup | ѕишпјо ш pesn uonpoyisspp OY) y 


ши 


аза Војоц2 54 pup s1ojponp3 Aq резп 559] 103 se1o»s реџелиођ jo седла | oloy 


(рапицио5) zz гјаој 


50 THE EVALUATIVE PROCESS 


being divided into ten months on the assumption that growth in achievc- 
ment takes place only during the ten months of the school year. 


GRADE SCORES Although grade scores were not considered suitable 
for the history tests, they could be used for the arithmetic test since instruc- 


100 


Raw Scores 
n 
e 


о 
о 


Extrapolated b 


4.0 5.0 6.0 7.0 80 9.0 10.0 11.0 
Grade Equivalents € 


Fig. 2.8 Hypothetical Norm Curve for Obtaining Grade Scores or 


Grade Equivalents* from Raw Scores on the Eighth-grade Arithmetic 
Test 


а Since the test was administered in the eighth month of the school year, 


raw-score value for the sixth grade is plotted above 6.8, the seventh grade above 
7.8, and the like. 


the average 


P Usually a norm curve for a standardized test involves extrapolation, 
the curve one or more years beyond the grades actually tested 
trend of the curve as a guide. Ordinarily, 


or extending 
, using the general 


extrapolation should be limited to one 
grade below or one grade above the ones to which the fest was actually admin- 


istered. For example, in this test, the scores below 60 could be recorded as 5.0 —; 
the ability of students scoring so low would be more adequately measured by an 
easier test. It would be undesirable in this case to record grade scores above 9.0 
because such a small difference in raw score is making a large difference in con- 
verted score. À more difficult test is needed to differentiate among superior students. 


Test Data in Terms of Converted Scores 51 


tion in arithmetic is given throughout the grades. To say that an eighth- 
grade student achieved as well in a comprehensive arithmetic test as the 
average pupil completing the seventh or eighth grade would be mean- 
ingful. 

Developing Grade Norms for the Local Arithmetic Test. To develop 
local grade norms for this test, one would administer it during the same 
month of the school year to all students (or a random sample) in the 
sixth, seventh, eighth, ninth, and tenth grades. That is, it would be best 
to administer the test to representative students at the grade level for which 
the test is designed (in this case, eighth grade) and also to representative 
students one grade and two grades below, as well as one and two grades 
above. If the students were all tested at the end of the year, the graph 
relating grade score (or grade equivalent) to average raw score might 
resemble Figure 2.8. From a curve such as this, a table of grade norms 
could be constructed. 

We would not be justified in obtaining grade norms for a test for a 
single grade level unless its content coverage were broad, including a con- 
siderable number of test items suitable for the younger and older students. 
Examination of an arithmetic subtest of a nationally used achievement test 
battery, such as the Stanford Achievement Test or the California Achieve- 
ment Test, will illustrate such breadth of coverage. 

The reader will recognize a problem involved in interpreting grade 
Scores on this arithmetic test for students scoring at the ninth grade level 
and above. The meaning of a grade placement of 10.0 or 12.0 in arithmetic 
is difficult to interpret since no formal instruction in arithmetic is usually 
given beyond the eighth grade. Such grade equivalents in the mechanics of 
English would be meaningful because formal instruction. in English 
continues throughout the high school years. 

In the interpretation of age and grade norms, we tend to assume that 
we have scaled our raw scores to an external dimension (age or grade) 
that represents equal intervals. This assumption is often far from valid; 
for example, a year of growth in arithmetic from 2.0 to 3.0, or from 9.0 
to 10.0, is far less than a year of growth from 6.0 to 7.0, because instruc- 
tion in arithmetic is emphasized in the upper elementary grades. On the 
whole, grade norms are most suitable for use at the elementary school 
level, with tests of abilities that show a relatively steady growth over the 
years of instruction. 

Grade Norms for a Published Achievement Test. A sample profile for 
the California Achievement Test, showing the data for a hypothetical stu- 
dent, is given in Figure 2.9. The grade equivalent for each of the subtests 
is given (following the student's score) and is also shown graphically on 
the profile. Since the student is in the eighth grade and was tested in March, 


THE EVALUATIVE PROCESS 


52 


"(6961 'пралпд ise] 01010}1|0о "ујео 'Keiejuow) әләт |oou»s чбїң jo!un[ ‘sysop јешало у мојој 
jaojo "A SIIM PUP sBai| "М is9u43 wos прелпд 4531 DIUIOH|DD әш jo uoissiuaed ayy чим peonpojdey 
"uepnig ЧИН Jorunr јоэцәщоаќн о 104 sijnseg 159] ешалон фу опилој јод јо ејцола 6% “Bld 


VSN н! сланина /озлиз али 518018 NIY 
зумјиноштизгазизакот “Vd нонузозн JANON 130 'пузипа 1521 VINBOSITYD AB O3HSIIGNS 


nv3una 1531 vinwosiava да 9619 14504440; 
(==) pem кон ар] ве ot usps aw] тема зи T'S) rapa ENIS [ONP vo роу тод врело новлар POEs pum 
05 хи ен G и: EE ] & d. 
E 


rj "|, 


e 


s РУ 


st 


ori 98 б> 


! Н Н i 
- Ц 1 
Р 3 Н 1 
" jy H H 
k ig i H 
i d Н H 
^ 4 ^ 
ә» Фм Го р 
$2 FE ~ 
ТИ 


m ит = Т 1 L— T 

9м! HSITON3_ | SIVLN3WVONNS SNINOSY3H. NOISN3H3udINOO АНУЛПВУООЛ 

11395 9 30 SOINVHO3W 'S OL 3NHLIBV "V ошзиншзу Є ЭМІОУЗЫ `Z SNIQV3N "T 
39Vn9NV1 DILIWHLIYY ONIGW3Y 


(exu) — мнуло "M SITTIM ONY 59311 "M ISINYI A8 G3SIAIG 
Р ———" ope @'н ny 
£l при UW. БА 40 204280) 133H$ зтнона 2USOND VIG 


m Р ө WOT • ла * Теле ЧЁтН лога] 
Tab] очив тел ET 77742225 19995 Бүзәт jueureAerqoy Pope =>) 


б] ж 
Про CUT гт nef) SWUON £961 НИМ ASN ЗОЗ] ° 


Test Data in Terms of Converted Scores 53 


his actual grade placement* (at the time of testing) is 8.6. That is, 8.6 is 
the national grade norm for him and his class. 

It will be seen from the profile that this student is 7 months below his 
actual grade placement (or national grade norm) in total reading. He is 
6 months below his actual grade placement in reading vocabulary and 
8 months below grade norm in reading comprehension. He is 1.5 years 
above norm in arithmetic reasoning, and one year above in arithmetic 
fundamentals. His total language is one month below national grade norm. 

Although the scores for minor subdivisions of each test (reading vocabu- 
lary in mathematics, science and the like) are plotted, one should not use 
these scores to read off grade equivalents at the right or left of the graph. 
Nor should one interpret small differences in these subscores as significant. 
As the reader will learn in the next chapter on reliability, scores on short 
tests are likely to fluctuate from one administration to another; and the 
interpretation of differences between scores is especially hazardous unless 
the scores compared are based on reasonably long tests of abilities that 
are not closely correlated with each other. 

Problems in Interpreting Grade Scores. Unless one keeps in mind dif- 
ferences in the spread of achievement in different subject areas, misin- 
terpretations can easily be made. Students at a certain grade level (for 
example, fifth grade) tend to be much more heterogeneous with respect to 
their achievement in reading and language usage than in such subjects as 
arithmetic, where the student's progress is more arbitrarily controlled by 
curricula and textbooks. 

It is evident from Table 2.8 that a class which is generally superior, 
in the sense that its average score in each subtest exceeds 75 percent of 
Students in the norming population, would appear to be doing much 
better in reading and language usage than in arithmetic. For a student or 


20 The actual grade placement for a child or a class group at time of testing is 
determined by the grade in which the pupils are enrolled and the date on which 
the test is given. A table for determining actual grade placements is usually given in 
the test manual. The following is quoted from the manual of the California Achieve- 
ment Test: 


Actual grade assignment is determined by adding to the pupil's grade the 
following fractions of a year: 


Months Low Section High Section 
September or February 0 25 
October or March 1 6 
November or April 2 7 
December or May 3 8 

4 9 


January or June 


Where schools have annual promotions only, ignore the Low Section and High 
Section captions. 


54 THE EVALUATIVE PROCESS 


class that is generally inferior, in the sense of exceeding m 25 quee 
f the norming population on each subtest the reverse would seem to e 
ds that is, higher achievement in arithmetic than in reading and language 
usage: Both situations are attributable to the greater homogeneity of 
students in arithmetic achievement. Similar comparisons for two other 
widely used achievement tests (the California Achievement Test and the 
SRA Achievement Series) showed similar results, with much greater 
heterogeneity in reading and language usage than in arithmetic.?: 


Table 2.8 
Grade Placements (GP's) for a Beginning Sixth-Grade Student or Class 
Achieving at a Generally Superior or Generally Inferior Level on the 
Stanford Achievement Test 


_—— — ——-———————__.__. 


GP's for student or class achieving at 


Subtest Р 5 LEVEL P,; LEVEL 
Reading ET 54 
Тапрџаре 8.0 4.9 
Arithmetic 7.0 5.4 


"e re и не 


Source: Warren С. Findley, 
Validity,” 18th Yearbook, Nati. 
Council, 1961), p. 32. 


“Use and Interpretation of Achievement Tests in Relation to 


опа! Council on Measurement in Education (Ames, lowa: The 


not infer, of course, 
as the grade scores imply if he 
igh school level. He might, but 
far beyond the data, Moreover, 
c is equal, or at the same level, 
centile scores, or the percentage 


we are certainly not justified in going that 
his achievement in language and arithmeti 
if the comparison is made in terms of per 


21 Warren G. Findley, “Use and Interpretation of Achievemen 
to Validity,” Eighteenth Yearbook, National Council on Measur 
(Ames, Iowa: The Council, 1961), pp. 32-34. 


t Tests in Relation 
ement in Education 


Test Data in Terms ој Converted Scores 55 


of sixth-graders that he exceeds (Table 2.8), or if comparisons are made in 
terms of normalized standard scores, which are derived from percentile 
scores. 

The hazards involved in interpreting grade and age norms are sufficiently 
great that the following recommendations аге made in the “Technical 
Recommendations for Achievement Tests." 


F 2 Where there is no compelling advantage to be obtained by reporting 
scores in some other form, the manual should report scores for defined groups 
in terms of percentile equivalents or normalized scores for defined groups. 
VERY DESIRABLE 

F 2.1 If grade norms are provided, tables for converting scores to percentiles 
(or standard scores) within each grade should also be provided. ESSENTIAL: 


Hence, it is evident that even though teachers feel at home with grade 
and age norms, it is best to supplement these norms by some type of con- 
verted scores that reflect the student's rank within his grade or some other 
appropriate reference group. Such supplementation is especially important 
for students or groups that deviate considerably from the average of their 
age or grade level in the abilities tested. 

For some kinds of decisions (such as deciding on the reading level of 
instructional materials for a student, or deciding whether he has made 
normal progress during a school year), we are interested in knowing what 
group the student most resembles—where he stands on the ladder” of 
mental growth or educational achievement. In these situations, it is mean- 
ingful to use grade or age norms. In other cases, for example, when we 
want to study a student's relative strengths for vocational guidance, we need 
to compare the student with those of his own age ог grade level, college 
major, or vocation; in these cases, percentile ranks or standard scores 
seem to serve the purpose better. For other decisions, such as grouping 
students within a school or selecting students for a scholarship, any type 
of converted scores, or even raw scores, are adequate for placing students 
in rank order. Whenever the ranking of students should be based on a 
composite of several scores, so that decisions can be made on the basis 
of as much information as possible, the use of some type of standard score 


(especially stanine scores) is advisable. 


are obviously useful during the preschool 


AGE SCORES Аре norms Д 
E lopmental maturity; for example, to say 


years for various evidences of deve 


merican Educational Research Association 
in Education, Technical Recommendations 
р. C. National Educational Association, 


?? Committee on Test Standards, A 
and National Council for Measurement 
for Achievement Tests (Washington, 
1955), p. 34. 


56 THE EVALUATIVE PROCESS 


that the child has as good balance or finger dexterity as а four-year-old is 
quite meaningful. In fact, age norms for development in motor skills, 
or any other aptitude highly correlated with age, are valuable. 

The age equivalents of raw scores on an achievement battery are called 
educational ages, while those for a reading test would be called reading 
ages. Such age scores are infrequently used in interpreting achievement test 
data. In fact, the chief use of age scores has been on tests of scholastic 
aptitude. | 2. и 

Originally developed in connection with individual tests of mental ability, 
mental ages have served as a meaningful type of norm, especially as data 
have cumulated from research studies concerning the mental ages appar- 
ently required for the achievement of various types of educational and 
vocational tasks. As we shall see in Chapter 6, however, the assumption of 
equality in units of mental age holds much better during the elementary 
school years than it does during adolescence; and the concept of MA 
becomes meaningless as an attempt to interpret the level of scholastic 


the normal or superior adult, 


Special Types of Converted Scores Designed to Serve Special Purposes 


inferences concerning how far students and groups 
P an educational or developmental “ladder.” Standard 


scores, despite their points of superiority, are always based on comparisons 
within a defined ag 


A few attempts have been made to develo 
advantages of standard Scores but are desi 


attempt. The K-scores for 
formance of tenth-grade students is equated to a 


of measurement (that is, a score unit of 1) is equated to 17 of the Sp of 
the scores of fifth-grade pupils. Differences between pairs of K-scores are 


STANDARD SCORES DESIGNED TO BE COMPARABLE FROM ONE HIGH 
SCHOOL SUBJECT TO ANOTHER Although percentile scores and normal- 
ized standard scores seem most feasible for use at the high school level, 


Test Data in Terms of Converted Scores 57 


they can also be misleading unless some adjustment is made for the dif- 
ferences in scholastic aptitude among students who elect the various high 
School subjects. If a student scores at the 80th percentile in a biology 
test and the 40th percentile in a physics test, one would seem justified in 
concluding that the student showed a marked superiority in biology as 
compared to physics. However, almost all students take required courses 
in biology, while only above-average students elect physics and only the 
superior students survive to take an end-of-course examination. Hence, 
it becomes impossible to make an inference about this student's relative 
competency in these two fields unless the norms are based on comparable 
groups of students. 

At least two publishers have developed systems of scaled scores for a 
series of high school tests, designed to cope with this problem. In a system 
of converted scores used by the Cooperative Test Division, Educational 
Testing Service, 50 is defined as the score that would be made by a 
student of average ability who had had typical instruction in the course. 
In another system (utilized in the Evaluation and Adjustment Series of 
tests in high school subjects, published by Harcourt, Brace & World), the 
M and SD for the standard scores on each subject-matter test are set at 
the same level as the M and SD for intelligence quotients for the repre- 
Sentative group of subject enrollees on which the test was standardized. 
For example, if students in the norming sample taking a specific mathe- 
matics test had an average IQ of 109 and ап SD of 12 IO points, the 
standard scores for that mathematics test would be established with an 
average of 109 and an SD of 12. This system also facilitates comparison 
between a student’s achievement and his capacity; for example, a student 
who made an average score of 109 in this mathematics test but had an IQ 
of 125 could readily be identified as working below ability level. 


DEFINING NORMING POPULATIONS AND 
SELECTING NORM SAMPLES 


Norms for the Population-in-General vs Norms for Homogeneous Groups 

In interpreting test data on average school аа in the funda- 
mental skills or in different content areas, genera -population norms аге 
needed as a basis for comparison. In counseling : student about educa- 
tional and vocational plans, however, gensa in ation norms may be of 
limited value. Norms are needed that are relevant to the inference one 
Wishes to make, for example, norms for "freshmen accepted by engineer- 


i РЕ ocation. 
ing schools" for a student aspiring to that v - 
Whether general-population norms ог norms ior selected homogeneous 


58 THE EVALUATIVE PROCESS 


s are provided, it is important that the test authors define the popu- 
Pci to be sampled and then use effective procedures to see that the 
ы samples represent those populations. TN 

A. test author, for example, might develop an entrance examin 
rivate liberal arts colleges. In such a case, his norming sample need not 
3 resentative of students in general, or even of college freshmen. If he 
pnr his population as "applicants for freshman standing in liberal arts 
colleges," he could proceed to sample that defined population. 

In standardizing vocational aptitude tests, several populations may be 
sampled to increase the number of valid inferences that can be made from 
test scores. A test of clerical aptitude, for example, should be normed on 
several relatively homogeneous groups of people with whom it 15 sensible 
to compare the student's performance. For example, a girl taking a secre- 
tarial course in a commercial high school would like to know how her score 
of 152 on the General Clerical Test compares with the scores of those with 
whom she will be competing for employment. Examination of the norms 


for this test reveals that her score of 152 can be compared with two rele- 
vant groups, as follows: 


PERCENTILE 
RANK NORMING SAMPLE INTERPRETATION. 
PR 85 Commercial high school Her score exceeds that of 85 percent 
senior girls of commercial high school senior girls 
in general 
PR 75 Commercial high school 


Her score exceeds 75 percent of com- 
mercial high school senior girls com- 
pleting secretarial training 


senior girls completing 
secretarial training 


In addition, the local school district has obt 
scores of applicants to two large local firms 
In comparison with firm X’s clerical applica 
comparison with their secretarial applicants, her PR is only 32. The other 
large firm in town, firm Y, has not compiled its own norms, but has merely 
established critical scores** on this and other tests, below which applicants 
will not be considered. For clerical applicants, firm Y has set a critical 


score of 130, for stenographic applicants, its critical or cut-off score is 
150.» 


ained information on the test 
that employ their graduates. 
nts, Sue's PR is 60, while in 


?" A critical score or a cut-off score is a score below which 
jected as being too low on some critical qualification. For exampl 
will have critical scores with respect to height and vision; people Scoring below 
these critical or cut-off scores are not considered further in the selection process. 

24 These case data are excerpted from “Norms Must Be Relevant,” Test Service 
Bulletin No. 39 (New York: The Psychological Corporation, 1950). 


applicants are re- 
e, the police force 


Test Data in Terms of Converted Scores 59 


From all these data, it is apparent that although this student is quite 
superior to her fellow-students (exceeding 85 percent), she excels only 
three-fourths of the secretarial graduates. According to the norms and cut- 
off scores of local firms, she would be considered above-average for clerical 
applicants but marginal as an applicant for stenographic work, insofar as 
the score on this test is concerned. Obviously factors other than clerical 
proficiency are also considered in the employment and retaining of appli- 
cants; and her score is sufficiently high that these other factors will prob- 


ably be taken into account. 

Figure 2.10 indicates clearly the importance of using such special norms 
to aid in interpreting tests used in vocational guidance or tests adminis- 
tered as aids in the selection and assignment of employees. Although scores 
on the number-checking section of the Minnesota Clerical Test are nor- 
mally distributed for workers in general, with relatively few obtaining A 


100% 
90% 
80% 
70% 
60% 
50% 
40% 
30% 
20% 
10% 
EDCBA CBA ВА 


Workers Routine Clerks Accountants- 
in General Bookkeepers 


Fig. 2.10 Occupational Differences on the Minnesota Clerical 
(Numbers) Test Showing the Percentage of Each Type of Worker 
Making a Given Letter Grade. 

From D. M. Andrew and D. G. Paterson, "Measured Characteristics 
of Clerical Workers," University of Minnesota Bulletin of the Employ- 
ment Stabilization Research Institute, Vol. 1, 1934. 


60 THE EVALUATIVE PROCESS 


and B scores, almost all clerks have A and B scores and almost all ac- 
countant-bookkeepers have А scores. In fact, an 18-year-old high school 
senior whose raw score of 106 would give him a PR of 58 for his age 
group and a PR of 74 for "all employed adults" would obtain a PR of 
only 1 for accountant-bookkeepers.** 


Procedures Used in Obtaining Norming Samples Representative of 
Norming Populations 


One of the questions frequently asked in measurement classes concerns 
the number of cases needed for an adequate norms table. Actually one can 
test many thousands of cases that happen to be available and still not have 
a representative sample of the defined norming population. Bias (usually 
in the direction of superior achievement) is often involved when the pub- 
lisher uses those test data that test users send in voluntarily or when he 
tests only those intact classes that school principals suggest as good classes 
to test. 

With respect to number of cases, the sample should be sufficiently large 
that the norm curve drawn would not differ significantly from one that 
would be drawn on the basis of another sample of the same size. Prior to 
our discussion of reliability in the next chapter, all we can say is that the 
size of norming samples and subsamples should be sufficient to provide 
stable values; that is, the converted scores for specific raw scores should 


not fluctuate significantly if the test were renormed on 


another sample of 
the same size. 


When a published test of achievement or aptitude is standardized to ob- 
tain national norms, the tests must be administered 
that are representative of students at each age and grade level (for which 
the test is designed) in the country as a whole. The problem of obtaining 
norming samples that are representative of age or grade groups within the 
population-in-general has always been a very difficult one. School children 
Of a specific age level, for example, age 11, are in different grades. The 
difficulties of locating representative age groups of preschool-age children 
аге even greater, 

The author of a high school test will have great difficulty in obtaining 
à representative sampling of 17-year-olds. He must either make extraordi- 


nary efforts to locate drop-outs, as well as accelerated students already in 
college, in order to obtain 


а representative sampli f all 17-year-olds; 
he can choose to define his : (Heal aet d 


< norming population as “students attending high 
school,” and then proceed to sample that population. 


The number and Tepresentativeness of the communities included in the 


to norming samples 


25 Donald E. Super, and John О Crites, А isi 7 7 
> an is » Appraising У, 
York: Harper & Row, Publishers, Inc., 1962), p. 166. prm Dep quer 


Test Data in Terms ој Converted Scores 61 


norming population tested are especially important in norming an achieve- 
ment test; a very large number of cases from only a few communities may 
result in distorted standards because of differences in curricular emphasis 
and achievement in various parts of the country. 

In the standardization of the Stanford Achievement Test, every student 
from certain grade levels in 340 communities was tested; then the norms 
were developed on the basis of a much smaller random sampling of these 
students. The 340 communities included 104 from the New England and 
Middle Atlantic states, 59 from the North Central states, 58 from the 
Southern states, and 119 from the Pacific Coast states. One hundred of 
the communities included were small towns under 2500 population; one 
hundred ranged from 2500 to 24,999; 104 were county, district, or union 
school districts; and 36 were communities of 25,000 or more. This norm- 
ing procedure is an example of superior practice in obtaining representa- 
tive samples. 

A growing body of information has been developed on characteristics 
of school systems that are related to test performance. The United States 
Bureau of the Census maintains a card file of approximately 70,000 school 
systems, with sufficient information concerning them, that one can draw 
a sample of school systems that is reasonably representative with respect to 
factors related to school achievement or scholastic aptitude. The following 
statement concerning the standardization of the Henmon-Nelson Tests of 


Mental Ability illustrates good procedures: 


A multi-stage sample of school systems was obtained, yielding 250 systems 
stratified by size of system (with 8 size classifications) and by geographical 
region. School directories for the 48 states were consulted so that an elementary 
school within each of the 250 systems could be drawn with probability propor- 


tionate to school size.?* 


The work of obtaining national norms will be greatly simplified when we 
have available the results for “anchor tests” administered to truly repre- 
sentative samples of the population-in-general. Through Project Talent and 
the work of the American Textbook Publishers Institute, anchor tests are 
being developed, which can be administered along with new tests to norm- 
ing samples and thus assist in the development of converted scores that 
are more nearly comparable from test to test."* Project Talent has involved 


26 Тот A. Lamke, “The Standardization of the Henmon-Nelson Revision," The 
Thirteenth Yearbook of the National Council on Measurements Used in Education 


(Ames, Iowa: The Council, 1956), p. 43. К | 
27Lee J. Cronbach, Essentials of Psychological Testing, 2d ed. (New York: 


Harper & Row, Publishers, Inc., 1960), р. 93; Roger T. Lennon, "Discussion of 
the School Administrator's Problems," Invitational Conference on Testing Problems 


(Princeton, N. J.: Educational Testing Service, 1957), p. 98. 


62 THE EVALUATIVE PROCESS 


the testing of a random sampling of 5 percent of all students in grades 
nine through twelve. Calibrating new tests against these anchor tests may 
achieve new levels of comparability in converted test scores. | 

The publishers of the Stanford Achievement Test and the Metropolitan 
Achievement Test have introduced a significant refinement into the devel- 
opment of grade norms. They have based their “modal-age grade norms 
on the scores of all students in a given grade who are typical with respect 
to age. By means of these norms, the achievement of a student can be 
compared with that of students who make normal progress in school, The 
modal-age group tends to be slightly higher in ability than an unselected 
group, in that larger numbers of dull overage than bright underage students 
have been excluded. Hence, the publishers contend that mod: 
norms provide a better standard of accomplishment for s 
norms based on the total grade population. 


al-age grade 
tudents than 


National Norms as Standards 


One of the disconcertin 


g phenomena in recent years is the extent to 
which national norms have 


been used as standards of accomplishment. The 
national norms for a test are a set of organized data that show the test 
scores earned by norming samples representative of defined populations. 
These norms should aid the user in making sensible inferences regarding 
(1) the present performance of individuals and groups on each test and 


(2) their relative level of performance in different subjects or character- 
istics. 


Norms for such groups are of considerable assist 
of student achievement of individuals and groups 


5 sample with the same age, grade 
placement, and mental age. These AAGP sc 


expectancy scores, which indicate the avera 
comparable to the examinee with respect to c 
tal age (MA) and actual grade placement. 
A quite different problem arises from the fac 
become complacent because their students are 
There should be little reason for satisfaction abou 


t that many schools have 
Up-to-norm on all skills. 
t average achievement by 


Test Data in Terms of Converted Scores 63 


classes with superior scholastic aptitude. For that matter, there is little basis 
for gratification when classes with average intelligence achieve at national 
norm when one considers the many factors (in the student and the learn- 
ing situation) that tend to keep students in the nation as a whole from 
achieving at an optimum level. 

Dressel has urged that norms be based on student achievement under 
optimal conditions so that schools could set for themselves standards that 
are not based on mediocre accomplishment. Such norms, of course, would 
have to be supplementary to the types that are now available.** These 
norms would be especially appropriate, as Dressel implies, in areas in 
which schools give lip-service to new objectives but do not proceed to im- 
plement the objectives effectively. 


USE OF DIFFERENT TYPES OF CONVERTED SCORES 
IN COMPUTATION 


Throughout this chapter, various statements have been made concerning 
computation with converted scores; for example, the reader was warned 
that he should not average percentile ranks. It might be well at this point 
to consider the four different types of number systems. 


1. Nominal numbers are used as symbols for categories. Illustrative of nominal 
numbers are area telephone codes or code numbers assigned to designate 
different areas of interest in an interest inventory. For such numbers, the 
term "greater than" or "less than" have no meaning. These numbers are 
countable, but cannot be ordered or ranked. However, frequency distribu- 
tions by categories can be prepared and analyzed by suitable statistical tech- 
niques. 

2. Ordinal scales involve numbers that can be counted and ranked but should 
not be used in computing sums, differences, or ratios; in an ordinal number 
system, differences of equal size do not have comparable meaning at dif- 
ferent points in the scale. If there are 20 students in a class, each student 
can be assigned a rank with respect to any defined characteristic; or students 
essays or art products can be ranked from 1 to 20 with respect to their gen- 
eral quality or some aspect thereof. However, one cannot infer that the 
differences in quality between ranks 1 and 2 is the same as the difference 
between ranks 11 and 12. An ordinal scale reflects position in an ordered 
series, but the scale does not have equality of units. 

Ranks and PR’s are ordinal numbers. Because of the lack of comparability 
of units at different points in the scale, age and grade equivalents should also 
be considered as ordinal numbers. For these types of converted scores, we 
should use only formulas that make no assumptions about equality of units. 


28 Pau] L. Dressel, “The Way of Judgment, The Fifteenth Yearbook of the 
National Council on Measurements Used in Education (New York: The Council, 


1958), pp. 5-8. 


64 THE EVALUATIVE PROCESS 


For example, one can legitimately compute the median rank or ја 2t са 
should not compute a mean, for its computation assumes equa e о 7 Я 
The quartile deviation*® should be used as a measure of variabi ity, ra «i 
than the standard deviation, which assumes equality of units. epe d о 
relationship must be obtained by special means suitable for ordinal numbers 
ter 3). 2 

З; feral % based on equal or comparable measurement units through- 
out the scale; that is, an interval of 1, 2, or more points has the same mean- 
ing throughout the scale. Interval scores cannot only be counted and ranked 
but used to compute sums of scores and differences between Scores. One 
can make inferences on the basis of these sums and differences. Interval 
scales, however, have no meaningful zero point. For example, a pupil mak- 
ing zero on a very difficult spelling test does not have zero spelling ability; 
he might make a fairly high score on an easy test. In fact, we devise many 
tests in such a way that the easy items are omitted to reduce test length; 
we do not test junior high school students on simple spelling words that 
almost all students have learned during their early school years. 

Nunnally illustrates an interval scale by the example of obtaining data on 
relative running times in a race (as one second behind the winner, two 
seconds, three seconds, and the like).2° We could average these data and 
perform any kind of computations with them that were based on differences 
between scores. However, we could not say, on the basis of these scores, 
that one runner was twice as fast as another. Our thermometers have an 
interval scale with equal units but no meaningful zero point. In fact 0° on 
the Centigrade scale Tepresents 32? on the Fahrenheit scale; moreover 09 


on neither scale represents the lowest possible temperature. Hence we cannot 
Say that 60? on either scale is twice as warm as 30°, 


If we consider raw scores on tests to represent * 


answered correctly,"?* we can treat these scores as in 
Scores can therefore be used in the co 


of relationship, to be explained in С 
Scores to compute ratios; that is we cannot say that a test score of 60 is twice 
as good as one of 30. 

- Ratio scales have not only equal measurem 


ent units but also a meaningful 
Zero point. Ratio numbers are countable an 


d rankable; they cannot only be 


?" The quartile deviation (Q) is equal to one 


and 0, (or Ра), The formula is: Q — 2-9. 
Jr., Tests and Measurements: Assessment and Prediction (New 
Book Company, Inc., 1959), pti. 

ider test scores as г 


-half the range between Q, (or Ра) 


80]; Cs Nunnally, 
York: McGraw-Hill 
31 When we cons 


difference in зре 


nt differe: 
The students with nee between scores of 95 and 100. 


У not have been able to show 

ever, Taw scores and their linear transformations 
а mem T 

sense of “number right” and hence are interval 


Test Data in Terms ој Converted Scores 65 


used in computations that assume equal differences between successive num- 
bers; but ratios between such numbers can be computed. That is, when we 
measure running time in seconds, we can say that one runner is twice as 
fast as another. Most scales in physical measurement are ratio scales. Hence 
we justifiably use them to compute ratios; we can say that one man is 1.25 
times as tall as another; that one man has walked twice as far as another. 

In testing, we obtain ratio scores only when we test a random sampling 
of a precisely defined universe of items. For example, when we develop a 
test that is a random sampling of all spelling words studied, we can say that 
Tom, who has a score twice as high as Sue, can spell twice as many of this 
universe of spelling words. We could make similar inferences about tests 
involving random samplings of the addition, subtraction, multiplication, or 
division facts. 


Of course, the errors of measurement? (which reflects variations in 
sampling of words and temporal fluctuations in student performance) must 
be taken into account in judging the confidence with which such a state- 
ment can be made. However, this type of score represents the closest ap- 
proach (in measurement) to ratio scores. Note that in interpreting student 
performance on a sample of spelling words, we do not generalize beyond 
the population of words sampled, in this case the words in the state speller 
for that grade level. In such a situation, а meaningful zero point exists. 

In some measurement textbooks, the statement is made that raw scores 
are meaningless; one can see that in a situation such as the spelling test 
described, raw scores can serve as the basis of meaningful inferences. 
Moreover, in many situations, where the user is concerned only with the 
student's rank within a group (such as in assigning marks or in selecting 
students for some award), raw scores are just as usable as converted 


Scores. 


SUMMARY STATEMENT 


Before a student's scores on different tests can be interpreted as indicating 
relative strengths and weaknesses, they must be translated into converted Scores, 
which indicate how his test performance compares with those of others in some 


reference group with which he can be appropriately compared. . 
Three major approaches to obtaining converted scores, on the basis of com- 


parisons with appropriate reference groups, Were studied: 
1. Standard scores (z-scores, T-scores, and the like, based on the differ- 
ence of a student’s score from the group average, expressed in SD 


units or some multiple thereof) : 
2. Percentile scores, normalized standard scores, or stanine scores (based 


on the relative position, or rank, of the student's score within the group 
of all students tested, or some defined reference group) 


32 Measurement errors are discussed in the next chapter. 


66 THE EVALUATIVE PROCESS 


3. Age or grade scores (the average age or grade status of students ob- 
? taining the same score) 


Each of these approaches was illustrated with local data compiled to answer 
спер with respect to student achievement in spelling, arithmetic, and 
istory. P 
I advantages and disadvantages of each type of norms were summarized 
in Table 2.7. In this table, the formula or the procedures for computation are 


given for each type of norm, as well as the typical reference groups used as a 
basis of comparison. 


SELECTED REFERENCES 


EBEL, ROBERT L., "Content Standard Test Scores," Educational and Psychologi- 
cal Measurement, vol. 22 (Spring 1962), pp. 15-25. 


ENGELHART, MAX D., "Obtaining Comparable Scores on Two or More Tests," 


Educational and Psychological Measurement, vol. 19 (Spring 1959), pp- 
55—64. 


FRANZBLAU, A. M., А Primer of Statistics for Non-Statisticians, New York: 
Harcourt, Brace & World, Inc., 1958, 

GARDNER, ERIC F., "Value of Norms Based on à New Type of Scale Unit," 
Proceedings, 1948 Invitational Conference on Testing Problems. Princeton, 
N.J.: Educational Testing Service, 1949, рр. 67-74. 

LYMAN, HOWARD B., Test Scores and What They Mean. Englewood Cliffs, N.J.: 
Prentice-Hall, Inc., 1963. 

NEDELSKY, LEO, “Absolute Grading Standards for Objective Tests,” Educational 
and Psychological Measurement, vol. 14 (Spring 1954), pp. 3-19. 

SEASHORE, HAROLD G., “Methods of Expressing Test Scores,” Test Service Bul- 
letin, No. 48. New York: The Psychological Corporation, 1955. Available 
on request. 

» AND JAMES Н. RICKS, JR., “Norms Must Be Relevant,” 

Bulletin, No. 39. New York: 

Available on request. 


Test Service 
The Psychological Corporation, 1950, 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. How should n 


3 ational norms оп achievement tests be used? Evaluate the 
following uses of n 


ational grade placement norms: 
: achievement is judged. 
attained by each individual. 
passed by the upper half of the class. 
i al in relation to 


со с» 
> 
ЕА 
p 
КА 
Б 
5 
a 
5 
a 
= 
о 
c 
в 


һе ауегаре accomplishment of pupils in 
elligence. 
2. List three typical situations in wh 


i ich achievement test norms can be used 
to advantage in interpreting test results. 


Test Data in Terms ој Converted Scores 67 


3. Why are age and grade scores unsuitable for tests in most high school sub- 
jects? What аге the relative advantages of standard scores and percentile scores? 
4. Using the mean and SD given in Table 2.3, convert the following raw 


scores into z-scores: 
Х=95,Х=75,Х = 65. 
5. Using Table 2.6, find the percentile ranks, T-scores, and stanine scores for 
the following raw scores: 
X = 55, X = 65, X = 75, X = 85. 
6. For each of the following sets of test scores for the sixth-graders of a 
school system, select a suitable size of interval, and set up a form for tallying 
the scores: 


TEST RANGE OF SCORES 
Spelling 9— 49 
Vocabulary 18– 98 
Interest inventory 70-240 


7. Tally the reading grade placements in Table 14.1. Compute Ра (the 
median score), Р, and P,;. : 
8. Would national norms be necessary for interpreting: 


a. A test of ability to understand oral Spanish which is given as the chief 


basis of admitting students to a class in Spanish conversation? : 
b. An intelligence test given to children being placed by an adoption 
service? 
c. A diagnostic reading test given to determine whether students need 


remedial instruction? | . . 
d. An aptitude test being used in the vocational guidance of high school 


students? 


3 Reliability 


When a person takes a test, we obtain a limited sampling of his perform- 
ance in the area tested. Two different forms of a test, even though they are 
designed to be equivalent, provide somewhat different samples of behavior. 
Moreover, individuals vary from one test session to another in their level 
of motivation, speed of working, and other characteristics. 

The concept of reliability has to do with consistency of measurement, 
the extent to which an individual's scores vary from one sampling to an- 
other of the same type of behavior. When we are making decisions con- 


cerning individuals, we need to obtain many samples of behavior. As 
Diederich says, 


There is very little hope of proving anything in education with single meas- 
ures. The real hope lies in repeated measurements: either testing many students 
with each single measure, or testing the same student with many different 
measures. Hence, we like to have repeated measurements of scholastic aptitude, 


reading achievement, and other important variables if we are to draw inferences 
concerning individuals. 


Variations in an individual's test behavior on different samplings of items 
and different testing occasions sometimes result in overestimates, and some- 
times in underestimates, of the examinee’s ability. 
pensating errors; they tend to average out in repeated testings. Other types 
of errors (such as a student’s habit of cheating on tests, or his characteristic 
carelessness in reading test directions) constitute systematic errors, which 


These errors are com- 


1 Paul B. Diederich, “Short-cut Statistics for Teacher- 
and Advisory Service Series No. 5. (Princeton, N. J.: 
1960), p. 20. 


Made Tests,” Evaluation 
Educational Testing Service 


68 


Reliability 69 


do not average out but tend to systematically raise or lower an individual's 
scores. In this chapter, we are concerned chiefly with the types of compen- 
sating errors involved in obtaining limited samplings of behavior. Sys- 
tematic errors are discussed more fully in Chapter 4 on validity. 


INTERPRETING TEST SCORES IN TERMS OF 
SOURCES OF VARIANCE 


Individual differences with respect to test scores arise from many sources 
(Table 3.1). Our aim in test construction and administration is to have 
most of the variation in test scores attributable to individual differences in 
the ability or trait we wish to measure. We recognize, however, that much 
of the variation in test scores is due to other factors, for example, indi- 
vidual differences in speed of working, "testwiseness," and other factors 
that affect a person's scores on many tests. 

Many of the converted scores discussed in the preceding chapter are 
based on a measure of dispersion or variability of scores among individ- 
uals, that is, the standard deviation. Although an approximation formula 
for the SD was used in Chapter 2, we should now learn the basic formula.? 


sp = | where x = X М 


The standard deviation can be computed by finding all x values (or devia- 
tions from the mean) and substituting them in the formula. Or one can 
use the short method with grouped data, as shown in Table A-3 in the Ap- 


pendix. i 
The expression under the radical sign E is called the “variance” (V). 


The term “variance” is an expression for the amount of scatter around the 


2 ТЕ we want to estimate the variability of the population that a sample repre- 


sents, the formula should read, с = зт Conventional practice calls for the use 


of letters, such as M, SD (ог s), for values obtained for specific samples; while 
corresponding Greek letters (p, а) are used for estimates or the mean and SD for 
the population, from which a random sample has been taken. Most textbooks use 
Greek letters in the presentation of formulas because it is assumed that the investi- 
gator is interested, not in the SD for his sample, but in с (an estimate of the 
standard deviation for the population which the sample represents). In many 
studies in education, however, the researcher is not interested in generalizing beyond 
the sample tested to the population that the sample represents. Hence, we have 
avoided the use of Greek letters in this textbook; with few exceptions, we are 


discussing data obtained from actual samples. 


70 THE EVALUATIVE PROCESS 


Table 3.1 
Possible Sources of Variance in a Test Score 


-—ў y 
LG—Relatively lasting? general» charac- 

Variance due teristics of the examinee 

to the LS and 

TS categories 

is measurable 


LS — Relatively lasting characteristics of the 
examinee elicited by this specific test 


А sample 

ki MER Т5 — Temporary characteristics of the ex- Variance due 

scores on aminee elicited by this specific test to categories 

different forms sample TS and TG 

or test samples TG—Temporary but general characteristics is measurable 
of examinee (likely to affect his per- by comparison 
formance of any tests given on that of examinees’ 
occasion, for example, a series of tests scores on 
given on a single occasion for civil different testing 
service testing or admission to college) occasions 


е __________ 
Illustrative Sources of LG, LS, TS and TG Variance 


LG—(LASTING GENERAL) VARIANCE 


1. Ability to respond successfully to stimuli of the t 

2. Ability to comprehend and follow directions, 

3. General abilities useful in many tests (for example, reading, perceptual speed, 
memory) 

4. Attitudes, habits, or emotional rea 
behavior in situations like the test 
anxiety, 
answers) 


уре presented in this test 
“testwiseness” 


ctions that characterize the examinee’s 
situation (for example, self- 


confidence, 
tendency to guess when uncertain, tendency to give sociall 


y approved 


LS— (LASTING SPECIFIC) VARIANCE 

1. Knowledges and skills required in this specific test sample (for example, 
knowledge of how to spell specific words, accuracy with specific number . 
combinations) 

2; Characteristic examinee attitudes, habits, or emotional reactions elicited by 
this specific test sample (for example, a general tendency to feel tense and 
anxious, “triggered early in the test by the inclusion of items not covered 
in local course of study) 


TS—(TEMPORARY SPECIFIC) VARIANCE 


1. Fluctuations in memory for particular facts 
2. Level of practice, or recenc 


degree of con 
sponse, or standards of judgment (resulting from factors Specific to the test 
sample, such as the examinee’s interest in the i 


science problems included in the sample) 


Reliability 71 


Table 3.1 (Continued) 
Possible Sources of Variance in a Test Score 


А" 
4. Temporary emotional states related to particular test stimuli (for example, 
a question that calls to mind an upsetting disagreement with someone in 


authority) 
5. Luck in the selection of answers by "guessing" 


TG—(TEMPORARY GENERAL) VARIANCE 


1. The examinee's condition on that testing occasion, with respect to such factors 
as health and emotional strain 

2. Effects of such conditions in the testing environment as heat, light, ventilation 

3. Level of motivation, as affected by his perception of purpose of testing, his 
rapport with examiner, and other factors 


Source: Adapted from Robert L. Thorndike, Personne! Selection (New York: John Wiley and 
Sons, Inc., 1949), p. 73 and Lee Cronbach, "Test Reliability: Its Meaning and Determination," 
Psychometrika, Vol. 12 (January 1947), pp. 1-16. 


"The term "lasting" refers to consistency from one time to another, or in this context, from 


one testing occasion to another. 

"The term “general” refers to consistency from one sampling of stimuli or content to another; 
or in this context, consistency of examinee performance from one "form" of the test to another 
equivalent form. When internal-consistency procedures are used to study reliability, the meaning 
of the word "form" is stretched to include alternative combinations of items within a single test, 


such as odd-numbered items and even-numbered items. 


mean (or the mean of the squared deviation scores). The relationship 
between the variance and the standard deviation is as follows: 


У = 


SD = 
In any set of test scores or other measures, the total variance (V) 
includes: 


variance: with respect to the ability or trait we are attempting to measure. 
2. a certain amount of invalid variance (due to individual differences in test- 


wiseness, cheating, and other systematic errors) | 
3. а certain amount of error variance, due to compensating errors, that tend to 


average out if repeated samplings of test behavior are obtained. 


A person sometimes scores “too high” or “too low” in a specific sample of 
behavior, but these inconsistencies average out over the long run. A useful 
analogy would be the tendency for a player’s daily batting average to be 


72 THE EVALUATIVE PROCESS 


“too high" or “too low” to represent him fairly because of chance factors; 

ail i iti gative “errors’ 
while in cumulated batting averages these positive and negative Ne 
tend to cancel out, resulting in less error variance and more reliable or 
en 

istent scores. | В | 

“= variance of the “compensating error" type arises from two major 
sources: 


ing from ambiguities in test questions and 
trument-centered errors resulting 
+ аи and also from the fact that we test only а sample rather than а 
total universe of information and skills. 


2. Errors resulting from temporal fluctuations in the individual examinec— 


variations from one testing occasion to another in his attitudes, speed of 
working, and other factors. 


If we define an individual's "true score" as the average of an infinite 
number of testings with the same instrument,? we can think of each per- 
son's score on a single testing as equal to his true score plus an error. (The 
expression "plus" is used in an algebraic sense with no implication that 
errors cannot decrease as well as augment a score). 


Then X = Xue + Xerror 
where X,,,, represents the “true score,” as defined. The average of an in- 
finite number of testings for an individual would be Хале, Since the sam- 
pling errors would tend to cancel each other out as considerable data were 
cumulated. 

The total variance in obtained scores for 


à group of examinees would 
equal the sum of the “true variance” 


and the error variance. 


Total variance — true variance + error variance 


ог 50°, = SD* rue + SD? 


error 
If it were feasible to test students one ћ 


changing their performance throu 
individual 


undred or more times (without 
gh practice, boredom, or resistance), each 
5 scores on these many different testings would be distributed 
according to a normal frequency distribution, with the mean score from all 


* The assumption is made that no learnin 
tions. Actually the term "true score" implies 
g instrument. Buros suggests the term 

“asymptotic score” for the limiting va 5 


as а limiting value and 
will agree that "asymptotic score" would be a more appropriate term than “true 
score." Oscar K. Buros, "Schematization of Old and New Concepts of Test Reli- 
ability Based on Parametric Models," to appear in 20th 


P с Yearbook, National Council 
on Measurement in Education." Ames, Iowa: The Council, 1964, 


Reliability 73 


these testings being his “true score." The SD of this frequency distribution 
of scores on repeated testings would be called standard error! or SE. 
Once we had these standard errors, we could compute a reliability co- 
efficient from this information. The reliability coefficient for a set of scores 
is defined as the proportion of the total variance that is "true" or nonerror 


variance. 
Expressed in a formula, 


ва ; 
true" variance 
Rel. coeff. = ————— —— 


total variance 
total variance — error variance 
total variance 


error variance 
total variance 


or Rel. coeff. = 1 — — 


Let us assume that our estimate of the standard error for a test (that is, 
the average standard deviation of repeated measurements for individual 
students) is 5 points. The error variance in this set of scores would then be 
SE*, or 25. Let us also assume that the standard deviation for the distribu- 
tion of scores (for all students in the group) is 15; here the total variance 
would be 15? or 225. Then, substituting in the formula given above, we 


can obtain the reliability coefficient. 


i 25 
Tid. coe = 1 DEDERE aq Se s TT = 0 
total variance 225 


As the reader has undoubtedly concluded, it is not feasible to obtain 
the value for the error variance by retesting students one hundred times. 
Therefore, other methods аге used to obtain the reliability coefficient. 
Then, when that coefficient is obtained, the SE (standard error) is com- 


puted from a variation of the formula given above. 


i |- SE 
Since the Rel. coeff. = SD: 
we can obtain, by transposing terms: 
SE* 
5р: ~” 1 — Rel. соећ. 


an be computed for many types of statistics (Тог 
ard error of the mean, or the standard deviation of 
a distribution of means that would be obtained by taking an infinite number of 
samplings from a population). However we will not use a subscript but will 
assume that SE refers to the standard error of measurement. 


+ Actually standard errors с 
example, SE;, denotes the stand 


74 THE EVALUATIVE PROCESS 


Then, by multiplying both sides of the equation by the denominator, we 
obtain: 


SE? = SD? (1 — Rel. соећ.) 


Taking the square root of each term gives the typical formula for the 
standard error. 


SE = SD V1 — Rel. coeff. 


This formula is often written SE = SD V1 -ru with гү, representing the 


correlation coefficient between individuals’ scores on one testing and their 
scores on another. 


COMPUTING CORRELATION COEFFICIENTS 


We must now turn our attention, from our discussi 
to the correlation coefficient itself. The correlation coeffic 
ingly useful measure of relationship, 


fraction the degree of relationship or “going-togetherness” between two 
variables, such as the tendency for persons who are tall to make more 
“baskets” in a basketball game, or the tendency for students of higher 1Q 
to have more extensive vocabularies.5 

In many situations throughout this textbook 
relationships between two variables (for example, scores on two tests for 
the same students). Relationships between two sets of data are studied in 
order to answer the question: How well can I predict a person's relative 
status in one characteristic if I know his status in another characteristic? 
For example, we might like to know the answers to such questions as: 


on of reliability, 
ient is an ехссед- 
which expresses in a single decimal 


; we will be concerned with 


5 The values of the correlation coefficient range in size from .00 
ct positive or a 


ample, for a given 
ther than decrease, 
се workers, there 


ho made the fewest 
and fewness of errors reflect а com 


posite of 


Reliability 75 


In other words, the chief interest of educators in correlation is the practical 
one of knowing how much dependence they can place upon certain types 
of data available to them in predicting other types of data that are helpful 
in their work. 


The Spearman Rank-Difference Method of Computing a Correlation Coefficient 


When one is computing a correlation coefficient for a small group, the 
most practical method is the Spearman Rank-Difference method. More- 
over, when one has only rank-order data, this method is the preferred one. 
The steps in the method are shown in Table 3.2, which illustrates the pro- 
cedure for finding rho, the rank-difference coefficient of correlation. Al- 
though this method is not suitable for computing reliability coefficients, for 
which a larger number of cases is needed, it is included here to help the 
student understand the meaning of a correlation coefficient. 


Table 3.2 
Computation of the Coefficient of Correlation 
by the Rank-Difference Method 


Grade Placements Ranks? 
ARITH. ARITH. 
REASON- REASON- 
STUDENT ING READING ING READING D рз 
іі АА 
1 5.5 6.2 8. 9:5 0.5 0.25 
2 4.0 4.9 25. 19. 6.0 36.00 
3 6.0 57 5. 12. 7.0 49.00 
4 5,4 4.8 9.5 20. 10.5 110.25 
5 4.8 55 16. 15.5 0.5 0.25 
6 52 57 12. 12. 0.0 0.00 
7 6.2 72 2.5 2; 0.5 0.25 
8 4.7 3.8 20. 29. 9.0 81.00 
9 4.8 5.0 16. 18. 2.0 4.00 
10 4.6 4.7 22. 21. 1.0 1.00 
11 4.8 5.6 16 14. 2.0 4.00 
12 5.2 4.1 12. 27.5 15.5 240.25 
13 5.9 6.8 6. 3. 3.0 9.00 
14 5.4 6.0 9.5 10. 0.5 0.25 
15 6.8 7.9 1. 1, 0.0 0.00 
16 6.2 6.2 2.5 m5 5.0 25.00 
17 3.9 44 27. 24. 3.0 9.00 
18 4.8 5.4 16. 17. 1.0 1.00 
19 4.5 4.1 23: 234.5 4.5 20.25 


76 THE EVALUATIVE PROCESS 


Table 3.2 (Continued) 
Computation of the Coefficient of Correlation 
by the Rank-Difference Method 


ee 


Grade Placements Ranks* 
a 
ARITH. ARITH. 
REASON- REASON- : 
STUDENT ING READING ING READING D D? 
_______= === 
21 52 6.5 12. 5. 7.0 49.00 
22 4.0 6.4 25. 6. 19.0 361.00 
23 6.1 5.7 4. 12. 8.0 64.00 
24 4.0 4.6 25. 22.5 2.3 6.25 
25 27 43 28.5 24.5 4.0 16.00 
26 4.7 3.3 20. 30. 10.0 100.00 
27 3.7 5.5 28.5 15.5 13.0 169.00 
28 5.6 6.7 y 4. 3.0 9.00 
29 47 6.1 20. 9; 11.0 121.00 
30 3.5 4.3 30. 24.5 5.5 30.25 
М =30 1558.50 = ED? 
N? = 900 9351  —6(ZD?) 
2 — 1 = 899 
6zD? 
p(rho) = 1 — — 
N (№ — 1) 
CUSED ie = Эй ER 
30 (900 — 1) 26970 


* When grade placements (or scores) are identical, their ranks are averaged. For example, in 


arithmetic reasoning, pupils 7 and 16 have identical grade placements of 6.2; hence ranks 2 
and 3 are averaged; and a rank of 2.5 is assigned to each pupil. 


An examination of the pairs of scores in the first two columns of Table 
3.2 indicates a tendency for higher reading grade placements to be asso- 
ciated with higher grade placements in arithmetic reasoning. This relation- 
ship is more clearly evident when each student is assigned his rank in the 
group, first with respect to arithmetic reasoning, then with respect to 
reading. Comparison of these two columns of ranks reveals that a few 
students have identical ranks on the two tests; a large number of students 
have ranks that agree closely; whereas a small number show marked differ- 
ences in rank. In the fifth column, the difference in ranks (D) is shown 
for each student. In the last column, the D value for each student has 
been squared. The sum of the last column (XD?) is 1558.5. It can readily 
be seen that the closer the agreement between ranks, the smaller this sum, 


Reliability 77 


and the larger the correlation. At the bottom of the table, the steps in the 
computation of rho are shown. How to interpret the correlation coefficient 
of .653 will be explained in a later chapter section. 


The Pearson Product-moment Method of Computing Correlation Coefficients 


There are several different methods of computing coefficients of corre- 
lation, which can be studied in a standard textbook on statistics. Each has 
its special uses. It is beyond the scope of this book to attempt an explana- 
tion of all these methods. In research and test construction, the method of 
computation most frequently used is the Pearson product-moment method, 
which we will now consider.? 


THE Z-SCORE FORMULA FOR PEARSON r There are a number of equiva- 
lent formulas for computing the coefficient of correlation, or r, by the 
Pearson product-moment method. The formula given below can be used 
only when the data for both the x and y variables have been changed into 
z-scores. 
= Baty 
ON 


Where z. stands for the standard scores in the x variable (reading) and 
z, stands for the standard scores in the y variable ( arithmetic reasoning) 
and Xz,z, represents the sum of the products of pairs of standard scores. 

In Table 3.3 this formula is applied to the same data previously used in 
the computation of rho. This standard-score formula is seldom used in 
practice, because of the work involved in computing numerous standard 
scores. However, this formula does illustrate the basic principles underlying 
the Pearson product-moment method. The product of each pair of standard 
scores for each student is first obtained. The value of r, as the average of 
all these products, is then computed. . 

It may be difficult to see why the average of these products is a good 
measure of relationship. The examples given below help one to understand. 


r 


1. Students 7 and 15 have high positive deviations from the means of both x 
and y; they illustrate a close relationship between arithmetic GP and reading 
GP; their products are large and therefore increase the value of r (the aver- 

roduct). РО 

2, Studente 25 and 30 show high negative deviations from the means of both 
x and y; they too illustrate a close relationship between arithmetic GP and 


RGP; and since they have large products, increase the value of r. 


6 The use of this method assumes a linear relationship between the two vari- 
ables; that is, we assume that a straight line can be drawn that will represent reason- 
ably well the tallies for pairs of scores (as shown in the scatter-diagram of Figure 


4.2) in Appendix D. 


Table 3.3 


Computation of the Coefficient of Correlation 
by the Pearson Product-Moment Method 


( stondard-score or z-score formula: r 


Grade Placements* 


ze 


Standard or z-Scores^ 


READING 


> 


3 


PRODUCT 


Arty 


+ .42 


Er 

О — –— л 

& oioi 
ое to 


~ 


ПА +++ 
Бо 
ol 


2212, = 19.88 


ARITH. 
ARITH. REASONING 
STUDENT REASONING READING Zy 
2 4.0 4.9 —13 
3 6.0 5.7 +13 
4 54 4.8 Ф 5 
5 4.8 5.5 sog 
6 52 5.7 + 3 
7 6.2 72 +15 
8 4.7 3.8 = Al 
9 4.8 5.0 = 5 
10 4.6 47 = 5 
x 48 5.6 - 3 
12 52 4.1 4.3 
13 5.9 6.8 ae 1 
14 54 6.0 + 5 
13 6.8 7.9 +23 
16 6.2 6.2 +15 
c 3.9 44 zd 
18 4.8 54 === 
12 4.5 4.1 = 7 
20 4.8 4.6 x8 
1 52 6.5 d 23 
22 4.0 6.4 ВЕ 
23 61 57 +14 
24 4.0 4.6 —{ 3 
25 3,7 43 16 
26 47 aa кз 
27 37 5.5 —1.6 
28 5.6 6.7 + 8 
29 47 6.1 =A 
30 35 43 —19 
€ 
pz o VOB а ба 
N 30 


? Reproduced from Table 3.2. 


"The z or standard-score values, 


in the fourth column were obtained by translating each 
arithmetic reasoning GP into a standard score by use of the formula z= без 
SD 


M 
s in which X 


stands for the original grade placement, M for the mean GP of 5.0, and SD for the standard 


deviation of 0.8. The z,, or standard-score values, 


lating each RGP value into a standard score by means of the same formula, 


in the fifth column were obtained by trans- 


in which X stands 


for the original RGP, M for the mean RGP of 5.4, and SD for the standard deviation for RGP's 


of 1.1. 


Reliability 79 


3. А pair of scores (as for student 22) with an arithmetic СР markedly below 
average and an RGP definitely above average illustrate a negative or inverse 
relationship between the two variables; the product of standard scores in this 
case is negative and decreases the value of r. 


TABULATING DATA IN A SCATTER-DIAGRAM Ordinarily, the first step in 
the Pearson method is to tabulate pairs of scores in a scatter-diagram, 
similar to those in Figure 3.1. Class intervals are set up according to the 
same principles used in setting up intervals for frequency distributions. In 
preparing a scatter-diagram for use in correlation, a tally is entered for 
each pair of individual scores, and the number of tallies in each square or 
cell is totaled. . | 

In preparing Figure A.2 in Appendix D, 500 pairs of scores Were tallied 
to obtain a reliability coefficient by the subdivided-test or split-halves 
method.” That is, for each student a tally was entered in the square corre- 
sponding to his score on the even-numbered questions (the y-variable) and 
his score on the odd-numbered questions (the x-variable). Scores are from 


the 100-item state history test. 


Interpretation of Correlation Coefficients 


Before we interpret reliability coefficients, we will consider the more 
general problem of what a correlation coeflicient means in terms of ac- 


curacy of prediction. 


INTERPRETATION OF r IN TERMS OF SLOPE OF THE PREDICTION LINE 
AND THE AMOUNT OF REGRESSION OF PREDICTED SCORES TOWARD THE 
MEAN A study of the relationship between two variables is usually made 
to ascertain how accurately we can predict students' scores on one variable 
from our knowledge of their scores on another. For example, we might 


predict a student's score on an algebra test from his score on a test in arith- 
metic, If there were no relationship between the two variables and we had 


no information to go on, our safest guess would be to predict that each 
student would make an average algebra score. In such a situation, informa- 
tion about arithmetic test scores would be of no help. On the other hand, 
if there were a perfect correlation of 1.00 between these two variables, 
the error of prediction would be reduced to zero, and each student’s rank 
on the algebra test would be the same as his rank on the arithmetic test. 


Our prediction equation (in z-score form) is: 
2р = Р Ze 
with 2, standing for the predicted standard score in the y variable. When 7 
is 0, z, = (0) 2,; the best prediction for any student is a z-score of 0, or 


7 This method is explained on page 86. 


80 


orrelation of.00 


о 


123456789 
|| 


Correlation of .40-.60 
123456789 


Correlation of 90-95 


THE EVALUATIVE PROCESS 


| wi 

nnn 
| | | | 
(| 


Correlation of .70-.80 
123456789 


ДЕ 

ШЕППЕН 

uu nn | | 
| || ти] 1] | 


7 
7 
7 
prj] | 
7 


Correlation of 40-60 


Fig. 3.1 Illustrative Scatter-Diagram Yielding Correlation Coeffi- 


cients of Various Sizes, 


Reliability 81 
M.. If r is 1.00, 2, = (1) Zz, and the predicted z, for each student is the 
same as his Zz. 


" З P А Zu . 
Mathematics majors will realize that 7, or Z, is the slope of the line of 
ar 


“best fit” (that is, the line that most accurately summarizes the relationship 
between the two variables). The equation of this prediction line (žy = F 2.) 
can be used in predicting the value of 2, for any student if we know his zz. 
The closer the relationship between x and y, the closer the line of “best fit” 
approaches a 45? angle with the x-axis, and the closer r (the slope of the 


prediction line) approaches 1.00. 


Prediction (Regression) 


Zy E k 
Line for 7^ of approximately 1.00 
2, =2 Tu 
y 
5 СУД 
| Prediction (Regression) 
Zy-0 (My J Line for 7 of approximately 
0.10 with Zy approaching 
Ze 0.0 (My) 
2 =-2 
Zy =-3 


Д3 Дд Ze- 2-0 Zel 202 Ze$ 
My 


Fig. 3.2 Prediction (Regression) Lines for Perfect Correlation and 


Extremely Low Correlation. 


If we tallied all pairs of z-scores, we could draw this line of best fit by 
verage scores of the columns, and then drawing a line 
by visual inspection that best fits this set of points. The slope of this pre- 
diction line would give us a rough estimate of the value of r (Fig. 3.2). 


The lower the relationship between X and y, the closer the line of best 
angle with the X axis, and the closer r (the slope of 


r is the slope of the predic- 


first plotting the a 


fit approaches a 0? à 
the prediction line) approaches 0. Obviously, 


rather than a straight line, a different method 


SIf the averages lie on a curve, ; 
should be used since Pearson r would 


of computing the correlation coefficient 
underestimate the degree of relationship. 


82 THE EVALUATIVE PROCESS 


tion line only if the x and y variables are in comparable units, for example, 
-scores. . "EQ 
: We could substitute any other value of r that we might obtain in this 

rediction equation; for example, if r is .70, zy = .7 z+, while if r is only 
p 2, = 3 Zz. Опе can see why the prediction line has come to be known 
as the regression line, since r is a measure of the extent to which predicted 
scores regress toward the mean; the lower the correlation, the greater the 
regression. 


INTERPRETATION OF /° IN TERMS OF PERCENT OF VARIANCE EXPLAINED 
AND THE STANDARD ERROR OF ESTIMATE Оп the scatter-diagrams of 
Figure 3.1 which yield a high r, the tallies cluster closely around the pre- 
diction line, the small amount of scatter of tallies around the prediction 
line indicating a closer relationship and relatively greater ассиг 
diction. When r is very low, the tallies are scattered throu 
in such a way that it is obvious th 
from x. 


acy of pre- 
ghout the columns 
at no sound basis exists for predicting y 


If we could obtain an approximate measure of the amount of scatter or 
variability of scores around this prediction line 
error of prediction or estimate. We could compute the SD for each column, 
which reflects the undependability of predicting y from x; then we could 
obtain a weighted average? of these column SD's. This average would be a 
fair approximation of the standard error of predicting ӯ from x МЕ у La. 

The higher the ratio of the "error of prediction” variance (SE,.,)* to 
the total variance (SD,)?, the lower the relationship between the two v 
ables. Conversely, the lower the ratio of 5Е°,.. to SD? 
lationship. SE?,.. represents the "unpredictable" varia 


» it would represent our 


ari- 
m the closer the re- 
nce. 

Proportion of 


"unpredictable" — _ 
уагїапсе 5р", 


But r? 


Tepresents the proportion of variance in one v. 
dictable f. 


rom the other. Therefore 


rè 


ariable that is pre- 


= 1 — proportion of unpredictable variance 
Si SE? 


— Peps 


SD* 


v 

If we wished, we could estimate ғ by (1) computing the proportion of 
variance that is “unpredictable” or due to errors of estimate, and then (2) 
subtracting this ratio from 1.00 to obtain the Proportion of predictable 
variance or 7^. Actually, r is usually computed by the Pearson product- 


9 Weighted in terms of the number of cases in each column. 


Reliability 83 


moment method and the standard error of estimate is obtained by the fol- 
lowing variation of the formula given above: 


Ев SD, Уі =r 

When r is .93 (as in Figure A.2), r* = .86. Hence, we can say that 86 
percent of the variance in the even-numbered scores is predictable from 
knowledge of the odd-numbered scores. If all examinees had identical odd- 
numbered scores, the variance in the even-numbered scores would be re- 
duced by 86 percent; only 14 percent of the variance, attributable to other 
factors, would remain. 

After this long “detour” to consider the computation and interpretation 
of correlation coefficients, we can now return to the subject of reliability. 
The student should review the concept of, and the formula for, the stand- 
ard error (SE) of a test score before studying the next chapter section. 


COMPARISON OF STANDARD ERRORS AND RELIABILITY 
COEFFICIENTS AS MEASURES OF RELIABILITY 


The concept of reliability is perhaps most easily understood in terms of the 
standard error. One can readily see how the standard deviation of a series 
of scores on repeated tests for the same individual reflects (1) the con- 
sistency of the test as a measuring instrument and (2) the fluctuations with 
time in certain characteristics of the examinee that affect his test scores. 
Use of the standard error also helps us to think of a student's test score as 
representing a range of probable scores. That is, if the SE for a specific 
intelligence test is 5 points, we can infer that the chances are two out of 
three that scores obtained by students differ from their “true scores" by 
less than 5 points. These “odds” are based on the fact that two-thirds of 
the scores lie within one SD of the mean (in this case, the individual's 
theoretical true score). 

Although the standard error is quite valuable in the interpretation of in- 
dividual test scores, reliability coefficients are preferred for comparing the 
consistency of measurement of different tests (for example, our locally 
developed arithmetic, spelling, and history tests). Standard errors are not 
comparable from one test to another; that is, the SE varies in size with 
the number of items on the test, as well as the 5D of the test scores. When 
one wishes to compare the reliability of two or more tests, the reliability 
coefficients usually provide a better basis for comparison than do the stand- 


ard еггогѕ.20 


10 However, the standard errors of two tests both utilizing T-scaled scores (or 
some other type of standard score) would provide just as good a basis as the 
reliability coefficients for comparing the tests with respect to consistency of meas- 
urement. When T-scaled scores are used, the SD’s for both tests would be 10; and 


the SE's would therefore be comparable. 


84 THE EVALUATIVE PROCESS 


METHODS OF ESTIMATING RELIABILITY OF TEST SCORES 


Reliability coefficients can be obtained in several different ways. When one 
examines Table 3.4, one realizes that there can be no single reliability co- 
efficient for a test. Moreover, since standard errors are based on reliability 
coefficients, there is similarly no such thing as гле standard error for a test. 
A. standard error reflects the types of error variance that are measured by 
the reliability coefficient on which it is based. The four most frequently 
used approaches will be considered in turn. 


Table 3.4 
Comparison of Different Types of Reliability Coefficients with 
Respect to Types of Error Variance Taken into Account 


——————————————— 


TYPE OF METHOD 


TYPES OF SCORE VARIANCE? COUNTED AS 
COEFFICIENT USED 


True variance Error variance 


Я 


TEST-RETEST METHOD 


Coefficient of sta- Same test sample LG,LS 
bility (consist- Different occasions 
ency over time 
onsame content) 


TG, TS 
Lasting general and Temporary general 

lasting specific and temporary 
specific 


—————————————————— 


EQUIVALENT FORMS METHOD 


Coefficient of Same occasion LG, TG LS, TS 
equivalence (con- Different test sam- Lasting general and Lasting specific and 
Sistency in per- ples (that is, temporary реп- temporary spe- 
formance on parallel forms eral cific 
specific content administered at 
samples) essentially the 

same time) 


n EN 


INTERNAL-CONSISTENCY METHODS 
Coefficient of in- 


с Internal analysis of LG, TG LS, TS 
ternal consistency data (same oc- Lasting general and Lasting specific and 
(approximation casion, different temporary реп- temporary spe- 
of coefficient of samplings of eral cific 
equivalence) items from same 


test; that is, sub- 
divided-test or 
Kuder-Richard- 
son method)^ 


u— EE C ——————— —J———— 


Reliability 85 


ТУРЕ ОБ METHOD TYPES OF SCORE VARIANCE" COUNTED AS 
COEFFICIENT USED True variance Error variance 


EQUIVALENT-FORMS METHOD WITH TIME INTERVAL 


Coefficient of Different occasions LG LS, TG, TS 
equivalence and Different tests, ie. Lasting general Lasting specific, 
stability (consist- administration of temporary gen- 
ency over time parallel forms eral, and tempo- 
and over specific with intervening rary specific 
content samples) time interval) 


“For examples of factors contributing to LG (lasting general), LS (lasting specific), TS (tem- 
porary specific), and TG (temporary general), the reader is referred to Table 3.1. 

"Although the Kuder-Richardson formula measures consistency of examinee performance on 
all items in the test, use of the Kuder-Richardson formula 20 results in a reliability coefficient 
which approximates the average of all the split-half coefficients which would be obtained on all 
possible divisions of the test into equivalent halves. The Kuder-Richardson 20 formula is com- 
puted from data concerning the proportion of examinees passing each test item and the SD 
of test scores. See J. P. Guilford, Fundamental Statistics in Psychology and Education (New 
York: McGraw-Hill Book Company, Inc., 1956), pp. 454-455. 


The Test-Retest Method 


When reliability is measured by the test-retest method, a coefficient of 
stability is obtained. This reliability coefficient measures error variance due 
to temporal variations in characteristics of the examinee, as well as varia- 
tion in conditions of test administration. Some of this temporal instability 
in test scores is due to variations from one testing occasion to another in 
the examinees’ general characteristics, such as in his health or emotional 
tension; part of it is due to variations in their reactions to the specific test. 
Illustrations of these sources of variance are listed in Table 3.1 under the 
headings TG and TS respectively. In other words, when the test-retest 
method is used, a coefficient of stability is obtained, which reflects only the 
TG and TS types of error variance (that is, variations in examinee test 
performance from one testing occasion to another). 

When the test-retest method is used, the interval between tests should be 
at least several days so that the student's memory of his answers does not 
spuriously increase the consistency of scores. However, the time interval 
should not exceed two or three weeks because we are trying to measure 
stability of student performance on the test, rather than the stability of the 


. 11 
interest, ability, or personality trait measured. 


11 А correlation coefficient, based on two testings between which opportunity 
for learning and/or maturation has occurred, is a useful statistic, especially for a 


86 THE EVALUATIVE PROCESS 


The test-retest method is infrequently used. Practice effects, which are 
not the same for all subjects, interfere with our attempt to measure the test's 
consistency. Moreover, students may be unwilling or unable to Tetake the 
test with the same level of motivation. Perhaps even more serious is the 
fact that the test-retest method fails to measure the types of error variance 
listed under LS, which result from the fact that a specific test includes 
only a sampling of content from the area that the test is designed to 
represent. 


The Equivalent-Forms Method 


Many standardized tests have two or more equivalent forms that have 
been designed to be comparable in content, length, difficulty level, and 
variance. When two equivalent forms (say forms A and B) are adminis- 
tered to students on the same оссаѕіоп,!° a coefficient of equivalence is 
obtained, which measures the consistency of examince performance from 
one specific sampling of test content to another, With this procedure, error 
variance due to TS and LS are measured, that is, temporary and lasting 
characteristics of the examinee that are elicited by this specific sampling of 


test items. This method does not take into account temporal fluctuations 
in examinee performance, 


The Internal-Consistency Methods 


Sometimes equivalent forms of a test are not available; sometimes it is 
difficult to obtain permission to have students take both forms of a test. 
For these reasons, internal consistency methods of measuring reliability 
have become popular. Such methods involve a comparison of examinee 
performance on different samplings of items from the same test. They pro- 
vide satisfactory estimates of the coefficient of equivalence. 

A frequently used “internal consistency” method of estimating test re- 
liability is the subdivided-test method, often called the split-halves method 
or the odd-even method. The last name arises from the fact that for most 


tests, the efficient way of computing such a reliability coefficient is to score 
the odd-numbered items as one “form,” the even-numbered items as an- 


counselor who is using present test data on aptitudes or interests as a basis for 
inferences about future performance. However, such a coefficient should not be 
interpreted as simply estimating reliability, or error variance; it requires a more 
complex interpretation. 

12 [n order to equate the effects of practice on student achievement on the two 
forms, it is good procedure to have half the students take form A and then form 
B; and to have the other half take form B first, followed by form A. 


Reliability 87 


other “огт,” and then to correlate students’ scores on the two halves of 
the test.'? 

The subdivided-test method, like the equivalent-forms method, takes 
into account variance due to the specificity of the tests and fails to measure 
temporal instability in test performance of students. In Figure A.2, the 
reliability coefficient has been computed by this method for the state history 
test. The reliability coefficient is .93. Correction with the Spearman-Brown 
formula would give a corrected coefficient'* of .96. 

Тће Kuder-Richardson method, like the subdivided-test method, is based 
on consistency in the student's test performance on different items. How- 
ever, while the subdivided-test method compared students’ scores on two 
halves of the test, the Kuder-Richardson method involves a study of inter- 
item consistency. With a relatively simple approximation formula, known 
as the Kuder-Richardson Formula 21, reliability coefficients can be com- 
puted quite easily, just on the basis of the mean (M), the number of items 
(п) and the standard deviation (SD) 

n (1 _ М pm 
й = 1 п (5р)? 


For example, on the state history test, where п is 100, the mean is 78, and 
the SD is approximately 10, the computation is as follows: 


W | = 78 x (22) 


FKR 21 = 


FER 21 


| 99 100 x (10)? 
= 1.01 (1 Е 1716) = 101 (% = 47) 
10,000 


= 1.01 (.83) = .84 


?It would not be satisfactory to score items in the first half of the test as one 
“form,” with the remaining items constituting the other “form.” Because easier 
items are included in the first half of the test, such a procedure would not result 
in “forms” of equal difficulty. Moreover, they would be dissimilar with respect 
to content. У А | 

14 When the reliability coefficient is computed in this way, a correction must 
be made for the fact that each of these halves is only half as long as the original 
test. Since a larger sampling of items results in greater consistency of measurement, 
the Spearman-Brown correction is applied to estimate the reliability coefficient 
that would be obtained for a test twice the length of the odd or even-numbered 
“form.” The table for Spearman-Brown corrections 1$ 1n Appendix D (Table Аг), 
However, the following formula is superior in that it is not based on the assump- 
tion (made in deriving the Spearman-Brown formula) that the SD's of the two 


half-scores are equal: 
SD, + SD, 
MEL 
SD, 


where SD, and SD, are the standard deviations of scores on the two half-tests and 
SD, is the standard deviation of scores on the total test. 


88 THE EVALUATIVE PROCESS 


uder-Richardson Formula 20 is used, a reliability coefficient 
o rre phe approximates the average of all split-half coefficients which 
would be obtained on all possible divisions of the test into equivalent 
halves. For most tests, however, the simpler Formula 21 will give nearly 

ults. 

"єт" still easier method of approximating a Kuder-Richardson relia- 
bility coefficient, the reader is referred to Table 3.5. To illustrate the use 
of this table, we will determine the reliability of the state history test. We 
need to know only that the test contains 100 items, that the average score 
is 78 percent, and that the SD is 149 as large as the number of items, or 
equal to .10n. Using the first table (A) for comparatively easy tests, we 
obtain a reliability coefficient of .85, which agrees very closely with the 
coefficient computed by Formula 21. 

From this same table we could also obtain the Kuder-Richardson re- 
liability coefficient for the arithmetic test from the data given in Table 2.3. 
This test is also an easy test, with a M of 84 out of 100 items. The 
reliability coefficient would be approximately .85 if the SD were .10n and 
.94 if the SD were .15n. Since the SD is -l1n, or one-fifth the difference 
between the two values heading the columns, we can interpolate and 
obtain a reliability coefficient of .85 + 15 C10) or .85 + .02 or .87. Thus, 
we see that these two locally developed tests are almost equally reliable 
when their reliability is measured by the Kuder-Richardson method. 


Table 3.5 
Approximate Kuder-Richardson Reliability Coefficients 


A. Reliability Coefficients for Comparatively Easy Tests 
(Average Score—70% to 90% correct) 
TĀ 
Reliability Coefficient When SD Is 


NO. OF ITEMS (n) -10n .15п -20n 
100 .85 94 97 
90 .83 .93 .97 
80 81 92 96 
70 -78 91 .96 
60 75 -90 95 
50 .69 .88 .94 
40 .62 .84 .92 
30 .48 .80 .90 
20 21 .68 .84 
pom CLR << РУШИ "ТА. „д 


15 This formula requires the computation of the Proportion of students passing 
and failing each item. 


Reliability 89 


B. Reliability Coefficients for Comparatively Difficult Tests 
(Average Score—50% to 70% correct) 


Reliability Coefficient When SD Is 


NO. OF ITEMS (п) -10n -15n .20n 
100 7 .90 .95 
90 74 .89 .94 
80 ~ .88 .94 
70 .66 .86 .93 
60 61 .84 92 
50 53 .80 .90 
40 ES 45 .87 
30 21 57 .83 
20 49 Л74 


Source: Adapted from tables in Paul Diederich, Short-Cut Statistics for Teacher-Made Tests, 
Evaluation and Advisory Service Series No. 5 (Princeton, N. J.: Educational Testing Service, 
1960), p. 29. 


Table 3.6 
Standard Error of Measurement for Different Values of the Reliability 
Coefficient and the Standard Deviation of Test Scores? 
(eS НЫЕ 


RELIABILITY SE When Standard Deviation? Is 

COEFFICIENT 1 2 3 4 5 6 7 8 9 10 

ee 
.98 01 03 04 06 07 O08 10 11 13 14 
95 0.2 0.4 0.7 0.9 11 1.3 1.6 1.8 2.0 22 
.90 0.3 0.6 0.9 1. 1.6 1.9 22 2:5 2.8 3.2 
.85 0.4 0.8 1.1 13 19 23 2.7 3.1 3:5 3.9 
.80 0.4 0.9 1.3 1.8 2j 2.7 3.1 3.6 4.0 4.5 
„79: 0.5 1.0 15 2.0 2:5 3.0 9.5 4.0 4.5 5.0 
70 0.5 1.1 16 22 27 33 38 44 49 55 
65 06 12 18 24 30 36 41 47 53 60 
.60 06 13 19 25 32 38 44 51 57 63 


€ E — ex MF ———————————— 
“The standard error of measurement is computed by multiplying the standard deviation of 


test scores by the radical \/1 — rel. coefficient. In other words 


SE—SD УТ г. : 

"The column headed 10 may be used whenever T-scores are involved. If the standard devi- 
ation of test scores is higher than 10, for example, 12 (as in ithe state history test), one can 
add the appropriate entries in the columns for standard deviens of 10 and 2 (or other 
appropriate combinations). In this case the SD is 12, the estimated reliability coefficient (from 
Table 3.5) is .85; hence the SE is 3.9 + ,8 ог 4.7. There ага two chances in three that a 
student's obtained score in the state history test differs from his “true score” by not more than 
4.7 points. A student's "true score" is a theoretical score—the average of all scores obtained 


оп an infinite number of retestings. 


90 THE EVALUATIVE PROCESS 


By the use of Table 3.6, we find that for tests with a reliability co- 
efficient of approximately .85, the standard error of a T-score for both the 
arithmetic and history tests (SD = 10) is 3.9, or approximately four 
score points. 


The Method Involving Equivalent Forms Administered with a Time 
Interval between Testings 


The reader will note from Table 3.4 that the administration of equiva- 
lent forms, with an interval of time between the two administrations, 
measures all the major types of error variance. Certainly, if the test user 
intends to measure student growth during some time interval by admin- 
istering equivalent forms of a test on different occasions, this type of 
reliability coefficient that measures both equivalence and stability is the 
one which should be computed. Naturally, reliability coefficients computed 
by this method will tend to be lower than those computed by any other 
method because more types of error variance are taken into account. 


Comparison of Methods 


Some measure of the stability of scores from one testing occasion to 
another should be given in any test manual. Ideally, the last method (which 
yields a coefficient of equivalence and stability) should be used; this 
method minimizes recall of specific answers and avoids other difficulties 
mentioned under the discussion of the test-retest method, 

The internal consistency methods help one to estimate how much of 
the variance in test scores is due to lack of equivalence between different 
samplings of items. The Kuder-Richardson method is better than the 
subdivided-test method in that it provides an estimate of the average co- 
efficient that could be obtained if all possible subdivisions of the test 
were utilized. 

The Kuder-Richardson method measures equivalence through a study 
of interitem consistency in student performance. If the universe of items 
we are sampling is fairly homogeneous, student performance will be fairly 
consistent from item to item. If the universe is very homogeneous, inter- 
item consistency will be unusually high; only a relatively small sample 
will be needed as a basis for inferences about student performance on the 
universe of possible items. 

This relationship of reliability to homogeneity of content is analogous 
to the situation we find in laboratory analysis of blood where a very small 
sample of the homogeneous "universe" being measured serves as a reliable 
basis for many important inferences. No dentist, however, would make 
inferences about the condition of one's teeth from the examination of two 


Reliability 91 


or three teeth, even though such a sampling constitutes a much larger pro- 
portion of the whole than the sampling of blood that the laboratory tech- 
nician obtained. The more homogeneous the content sampled, the more 
consistent the results from sample to sample. 


Table 3.7 
Reliability Coefficients for Four Subtests of the SRA Primary 
Mental Abilities for Ages 11 to 17 
== 
Reliability Coefficient Found by 


SPLIT-HALF SEPARATELY-TIMED 
SUBTEST METHOD HALVES 
Verbal Reasoning 94 -90 
Reasoning 96 87 
Space .90 AS 
Number 92 .83 


Source: Anne Anastasi and J. D. Drake, "An Empirical Comparison of Certain Techniques for 
Estimating the Reliability of Speeded Tests,” Educational and Psychological Measurement, vol. 
14 (Autumn 1954), pp. 529-540, 


Neither the Kuder-Richardson nor the subdivided-test method should 
be used if tests are highly speeded.'" The way in which an internal- 
consistency. method can spuriously inflate the reliability coefficients of 
speeded tests is shown in Table 3.7. The authors estimated the extent to 
which each subtest of the Primary Mental Abilities Test was speeded." 
Their findings indicated that Verbal Reasoning scores were least affected by 
individual differences in working speed; that scores in the Reasoning test 
were somewhat affected by speed; and that the Space and Number tests 
were highly speeded. As Table 3.7 shows, the reliability coefficients of the 


15 On a highly speeded test, students would tend to get similar scores on chance- 
halves of the test simply because their speed of working in a specific testing session 
would have enabled them to cover numbers of items on the two halves of the 
test. If the odd-numbered items and even-numbered items were typed or printed 
as separate forms and administered under separate time limits, the split-halves 
method would be acceptable. A test should be considered a speeded test if there 
is considerable variation with respect to the number of items omitted at the end 
of the test, or if there is a low correlation between scores earned on the test when 
administered with and without time limits. А 

17 The method used was to find the variance with Tespect to number of items 
Completed by students and then divide this variance by the total variance of test 


Scores. 


92 THE EVALUATIVE PROCESS 


relatively unspeeded verbal reasoning test were approximately the same 
when the conventional split-halves method was used, as compared with 
the separately-timed-halves method. For the other tests, the conventional 
split-halves method gave an inflated estimate of reliability. 


RELIABILITY OF DIFFERENCE SCORES 


When we compare the scores of students in two tests (for example, the 
arithmetic and history tests), we wish to know whether the differences 
are largely attributable to errors of measurement or whether we would be 
likely to find similar intraindividual differences if we retested. That is, we 
are concerned with the reliability of “difference scores.” Unfortunately, 
the reliability of differences between pairs of scores is much less than the 
reliability of either score. Two factors are responsible for the lower relia- 
bility of difference scores: (1) the errors of measurement in both tests 
affect the error variance of the difference; and (2) whatever is common 
to both tests (arithmetic and state history) is canceled out in the com- 


Table 3.8 
Reliability of Differences" between Standard Scores 


е —————____.. 


CORRELATION RELIABILITY COEFFICIENT FOR DIFFERENCE SCORES 
BETWEEN WHEN AVERAGE RELIABILITY COEFFICIENT 
TWO TESTS OF TWO TESTS IS 


c—— — a n, 


70 73. .80 .85 .90 .95 
.00 .70 „1% .80 .85 .90 «98 
10 -67 72 78 .83 89 94 
20 .63 .69 75 81 .88 94 
30 ST .64 71 79 .86 93 
-40 -50 58 :67 349. .83 91 
-50 -40 -50 -60 70 .80 -90 
-60 25 .38 -50 62 75 .88 
70 00 .17 33 -50 .67 .83 
75 -00 20 40 -60 .80 
.80 -00 25 .50 75 
.85 00 33 67 
-90 .00 -50 
95 -00 


* Computations are based on the following formula 


Rat — Av. rel. coef. of 2 tests — Intercorrelation between 2 tests 
el. ае. = —77——— ——- о Ови 
| 1 — Intercorrelation 


Reliability 93 


putation of the difference. If two tests overlap considerably with respect 
to abilities measured, a considerable portion of the consistent variance in 
each score is due to the overlapping part. When that variance is sub- 
tracted, as it is in obtaining difference scores, the remaining test variance 
contains a larger proportion of error. 

Let us assume that the intercorrelation between the arithmetic and 
history tests is .60. We can look in the appropriate row of Table 3.8 and 
find, in the column headed .85 (the average reliability coefficient of the 
two tests), that the reliability of the difference scores will be only .62. 

If we wish to compute the standard error for the difference scores (for 
comparisons between the arithmetic and history tests, each of which has 
an SE of 4), we can use the following approximation formula. 


ЗЕци = УЗЕ, + SES = VF + F = V32 = 57 


With a SE, of 6 points, the chances would be two out of three that a 
student's difference score was within 6 points of his true difference score. 


FACTORS AFFECTING THE SIZE OF RELIABILITY COEFFICIENTS 


Obviously, we would like test scores to be as reliable as possible. How- 
ever, it would be a mistake to assume that we should simply list the relia- 
bility coefficients for several published tests, among which we wish to choose, 
and select the test with the highest coefficient as best for our purposes. 


Methods of Estimating Reliability 


Comparing the reliability coefficients for different tests requires careful 
attention to factors that affect the size of such coefficients. We have already 
considered one of the major factors affecting their size, that is, the method 
used in obtaining data on reliability. The method that takes into account 
both stability and equivalence will tend to give lower coefficients than the 
other methods because all major types of error variance are included, The 
Kuder-Richardson method will tend to yield lower coefficients than the 
Split-halves method because the former reflects test homogeneity, as 
reflected in all interitem relationships, rather than merely the consistency 
of scores on two halves of a subdivided test. . 

With any method involving two testing occasions, the longer the interval 
of time between two test administrations, the lower the coefficient will 


tend to be. 


18 ТЕ the correlation between two tests were positive and perfect, all the difference 
Scores would be zero; if the correlation were —1.00, the difference scores would be 
Of maximum size. In between these two extremes, the larger the positive correlation 


between the tests, the smaller the differences will tend to be, 


94 THE EVALUATIVE PROCESS 
Heterogeneity of Group 


Another factor affecting size of reliability coefficients is more difficult 
to understand, that is the "range of ability" ог dispersion of scores within 
the group on which the reliability coefficient is computed. For example, the 
reliability coefficients for intelligence tests, computed on elementary ог 
high school groups, tend to be higher than those computed on college 
groups, which show much less dispersion with respect to IQ. 

We have already cited two different reliability coefficients for the state 
history test. The coefficient of .85 was obtained by the Kuder-Richardson 
method, the one of .96 by the split-halves method.!* This large difference 
cannot be explained entirely in terms of method. Another reason is that the 
reliability coefficient computed by the split-halves method was based on 
students from all schools in the district, a more heterogeneous population 
than the Central High School group, on which the Kuder-Richardson 
coefficient was obtained. The SD for “all schools” was 12, rather than 10. 
If the split-halves reliability coefficient had been computed on a sample 
with the smaller SD of 10, the coefficient would have been reduced”? from 
-93 to .85. Hence, we see that some of the difference is due to method and 
some to differences in the homogeneity of the group studied. 

Ideally, reliability coefficients should be com 
have about the same SD as the group for whic 
make interindividual comparisons. The user of a 
wishes to compare individuals within a single gr 
district, rather than individuals within a larger, more heterogeneous popu- 
lation. Therefore, an increasing number of test manuals are presenting 
reliability coefficients by grade level for each of several schools, or each 
of several communities, 

Table 3.9 illustrates good practice in that reliability coefficients were 
computed for single communities; the range of coefficients is shown, rather 


than coefficients based on the more heterogeneous population of all four 
school systems combined. 


This table, however, like tho 
quate data concerning the sam 
mendations specify that relia 


puted on groups which 
h the test user wishes to 
n achievement test usually 
ade in a school or a school 


se in many test manuals, provides inade- 
ples of students tested. The Technical Recom- 
bility samples should be described.?! 


1? The coefficient of .93, computed in Figure A2, was corrected by the Spearman- 
Brown method (Table A.2) to .96. 

20 A table for correcting values of r for “restriction in Tange” 
standard textbooks in statistics, for example, Quinn McNemar, Psy 
tics, 3d ed. (New York: John Wiley and Sons, Inc., 1962), p. 144. 

21 “Technical Recommendations for Psychological Tests and Diagnostic Tech- 
niques" Supplement to the Psychological Bulletin, vol. 51 (March 1954), p. 230. 


is given in most 
chological Statis- 


Reliability 95 


Table 3.9 
Reliability Coefficients and Standard Errors of Measurement for 
Subtests of Metropolitan Achievement Tests, Intermediate Level 


oo oO 


ni SE sous. 
TEST RANGE MEDIAN RANGE MEDIAN 
анаи 
1. Word Knowledge .88–.95 94 3.0-3.4 3.1 
2. Reading .89-.92 .90 2.5-2.8 2.6 
3. Spelling .91—.96 92 2.6–3.5 3.0 
4. Language 
Part A—Usage .78-.84 81 1.9-2.5 2.2 
Part B—Parts of Speech 64—77 12 1.3-1.3 1.3 
Part C—Punctuation and 
Capitalization -80-.88 83 2.1-2.4 2.2 
Total (Parts A-C) .87–.91 .89 3.3-3.5 3.3 
5. Language Study Skills .76-.85 79 2.0-2.4 22 
6. Arithmetic Computation .82—.94 .88 2.1-2.7 2.4 
7. Arithmetic Problem Solving 
and Concepts .90-.95 .92 2.2-2.5 2.4 
8. Social Studies Information .86–.87 .87 3.3-3.5 3.4 
9. Social Studies Study Skills .64–.77 23 2.2-2.5 2.2 
10. Science .87–.90 .89 2.8-3.3 3.0 


Source: Walter N. Durost, Manual for Interpreting Metropolitan Achievement Tests (New York: 
Harcourt, Brace, & World, Inc., 1962), p. 46. 

“Values reported are ranges and medians of four independent estimates of corrected split- 
half coefficients. Each estimate is based on a random sample (N = 100) of grade 6.1 pupils, 
Each sample was chosen from a single school system, with four school systems being used at 
each grade level to typify high, low, and average performance on the test. 

" Standard error of measurement in terms of raw score. 


Standards for Reliability Coefficients 


No arbitrary standards can be established regarding satisfactory levels 
for reliability coefficients. Obviously, the highest requirements must be set 
when we are required to make major decisions about individuals on the 
basis of a single test; but fortunately such situations are rare. Another 
situation demanding high test reliability is one where we want to interpret 
intraindividual differences, that is, differences between individual scores 
On tests that measure various components of scholastic aptitude, musical 
ability, and the like. The lowest demands on reliability would be made 
When we are comparing the average scores for large groups. For example, 
in the problem described in Chapter 1, in which we are comparing the 


96 THE EVALUATIVE PROCESS 


average spelling achievement of several classes using teaching machines 
е + . 

with the average for several classes using standard methods, a less reliable 
test would be acceptable. Met | | 

On the basis of certain assumptions concerning the accuracy with 
which a test should discriminate between groups and between individuals, 
Kelley? derived the following minimum reliability coefficients for tests 
used for different purposes. 


MINIMUM RELIABILITY 


PURPOSE FOR WHICH SCORES ARE USED COEFFICIENT 


To evaluate level of group accomplishment | .50 
To evaluate differences in level of group accomplishment їп 

two or more performances .90 
To evaluate level of individual accomplishment .94 
То evaluate differences in level of individual accomplishment 


in two or more performances .98 


These values have been widely quoted. However, there has been a grow- 
ing recognition that one can often use to advantage tests with reliability 
coefficients below these minimums. 

In practice, one attempts to find a test that equals or surpasses in relia- 
bility the values typically attained in that field of measurement. For 
example, we should not reject a test of listening comprehension with a 
reliability coefficient below .94 if our only substitute for such testing is to 
rely on measures, such as observations and rating, which are far less 
reliable. 

Application of arbitrary standards in the selection of tests can result in 
unwise decisions. For example, we might select a test with a reliability 
coefficient of .95 (based on the split-halves method and a combined student 
population from several grade levels) and reject another test with a relia- 
bility coefficient of .92 (obtained by administering alternate forms at a 
two-weeks interval to students of a single grade level in five different com- 
munities and taking the median coefficient), The reliability coefficient for 
the first test is spuriously high because of the heterogeneity of the group 
on which the reliability coefficient was computed. The latter test would 
naturally have a lower reliability coefficient because the equivalence- 
stability coefficient measures all sources of error variance, and the co- 
efficients were computed for several relatively homogeneous groups. The 


?? Kelley assumed that a test should permit discrimination of differences in an 
attribute as small as one-fourth the stan 


dard deviation for a grade-level group, with 
chances of five to one of being correct about the dir 


ection of the difference. 
23 T. L. Kelley, Interpretation of Educational Measurements (New York: Har- 
court, Brace, & World, Inc., 1927). 


Reliability 97 


reliability coefficient of the first test might prove to be considerably lower 
than .92, if its reliability coefficient were computed under the same exact- 
ing conditions as were used for the second test. 


IMPROVING THE RELIABILITY OF TEST SCORES 


We have already considered some of the factors that affect the size of 
reliability coefficients—factors that must be taken into account when we 
are comparing reliability coefficients computed by different methods with 
groups varying in heterogeneity. We have not yet considered the ways in 
which the reliability of test scores can be increased. 


Increasing Length of Test or Size of Sample 


Tests always constitute limited samples of behavior. Hence, a basic 
approach to improving reliability of scores is to increase the size of the 
sample. If we check Table A.2, we see that if a test with a reliability co- 
efficient of only .82 is doubled in length, the estimated reliability coefficient 
of the longer test would be .90; if it were tripled in length, the estimated 
reliability coefficient would be .93. We can see that reliability increases 
with size of sample, but that the increase is rather slow. If we are con- 
structing a test, we have to ask (1) whether we can double or triple the 
number of items and still maintain item quality, and (2) whether the in- 
crease in reliability justifies the additional testing time. If we are just 
interested in seeing how well the group as a whole is doing, a shorter, less 


reliable test would be adequate. У 
One application of this principle is a clear-cut one because the quality 


of the "additional items" would be the same. That is, if we find that one 
published test has the desired emphasis with respect to subject content 
but has a reliability coefficient of only .82, we can administer both forms 
of this test and compute average student scores on the two forms. Thus, we 
would have a longer test (form A plus form B) that would have a relia- 
bility of .90. Using these combined forms to evaluate student achievement 
in the local educational program would undoubtedly be better than using 
another test with a reliability coefficient of .90 but with subject content that 
was not nearly so appropriate to the local pattern of emphasis within a sub- 


ject field. 


Increasing Objectivity in Scoring 


Another major factor affecting reliability of measurement is subjectivity 
of judgment. We tend to get low reliability coefficients for rating scales, 


98 THE EVALUATIVE PROCESS 


essay tests, ratings of students’ products in shop and homemaking, and 
the like. When a test is objectively scored, this objectivity improves the 
i f measurement. 
Qc gei penes of a measurement depends on the degree to which per- 
sonal subjective judgment has been minimized in the Scoring process. In a 
multiple-choice test that can be scored by machine, or by a clear-cut 
scoring key for right-and-wrong answers, the ideal in objectivity is achieved. 
If some judgment is involved in scoring, but the manual gives fairly pre- 
cise rules that increase scorer agreement, scoring still remains fairly objec- 
tive. For example, in the scoring of our 25-word spelling test, scorer 


agreement would be increased and test scores would be more consistent 
and reliable if we specified that: 


1. A student should not be penalized for failing to dot ап i or cross a t 

2. Since spelling, rather than handwriting, is being measured, distinction be- 
tween a's and o's (and other easily confused letters) should be made on the 
basis of noting the student's usual ways of writing these letters, as shown 
in his writing of “personal information" items at the top of his test paper 

3. Any clear correction in the spelling of a word should be credited. 


These rules, or similar ones, would increase interscorer agreement and 


make the test scores more Objective and reliable. In tests that require 
judgment in scoring, many examples should be given of responses that 
should or should not be credited; for example, in the vocabulary test of 
the Stanford-Binet, types of definitions for each word that would be given 
“full-credit,” “half-credit,” or “no credit,” have been included. 


Maximizing Consistency in Test Administration 


All estimates of reliability, 
reflect the error variance that i 
ministration, such as deviations 
tation of test requirements. In о 


Make sure that’ your score on this spelling test is not lowered by careless 
handwriting. Look back over your work; dot your 75 and cross your ге. Rewrite 


any word in which the letters are carelessly written so that an a might be con- 
fused with an o, an m for an n, or an e for an i. 


Reliability 99 
Selecting Tests of an Appropriate Level of Difficulty for Students 


When tests are administered to very heterogeneous groups of students, 
as is so frequently done in city-wide testing programs, some of the stu- 
dents will find the test so difficult that they will do a great deal of guessing; 
hence the consistency of these students’ scores from test to retest, or from 
one form to another, will be low. 

For very able students or groups, the test selected for city-wide use 
may be ineffective in differentiating among them. Students in able groups 
get most of the test items right; for each able student, his score might be 
interpreted as a constant (the number of items of low and average dif- 
ficulty) plus the individual score he gets on the small proportion of more 
difficult items. It is as if the test had been shortened to the few difficult 
items that differentiate among able students, that is, which some able 
students answer correctly while others do not. If the reliability coeflicient 
of such a test were computed on classes of able students, it would be low, 
reflecting the brevity of the test for these students and the inconsistencies 
in their ranks from one "short test" to another. 

A desirable approach to this difficult problem is to investigate the con- 
sistency of measurement at different score levels. Then the standard errors 
of measurement for different score levels can be reflected in the norms 
table, as the Educational Testing Service has done in their use of per- 
centile bands for the STEP tests. However, if we wish to reduce the error 
of measurement, rather than just take it into account in interpretation, it 
may be best to replace the single city-wide test by two or more tests that 
are geared to the different levels of achievement of students. . 

The Educational Testing Service has planned its STEP test series so 
that STEP tests of two or more levels can be administered to a group at 
the same time. For example, levels 3 and 4 (designed respectively Tor 
grades 7-9 and grades 4-6) can be administered together in a single 
classroom, with the more able sixth-grade students taking level 3, while 
other students in the same classroom take level 4 at the same time. The 
directions and time limits are identical, and the test materials do not 
specify the grade levels for which they are designed. А 

Table 3.10 illustrates excellent procedures in computing the standard 
errors of measurement at different score levels. A large number of students 
at each grade level were administered both form A and form B of the 
Lorge-Thorndike Intelligence Tests. The average raw score on forms А 
and B was computed for each student. Then subgroups were formed of 
students whose average scores fell at each raw-score level (for example, 
those with average raw scores of approximately 15, 20, and the like). 
Then for the students in each raw-score subgroup, differences between 
form A and form B scores were tallied and the SD's computed. These 


100 THE EVALUATIVE PROCESS 


SD's are the standard errors of measurement for each score level. If we 
examine the fifth column in Table 3.10, (for 3d grade pupils, verbal 
battery), we see that the standard error is approximately twice as large for 
pupils obtaining raw scores of 20 and below, than for those scoring in the 
middle range. More reliable estimates of performance on this type of 
intelligence-test items would probably be obtained by administering the 
next lower level of the test to these low-scoring pupils. 


Table 3.10 
Standard Error of Measurement of the Lorge-Thorndike Intelligence Tests 
(Grades 3—5) at Selected Raw Score Levels 
с 


STANDARD ERRORS OF MEASUREMENT 
(IN IQ POINTS) 


XN ERXGE. Nonverbal Battery Verbal Battery 
RAW SCORE GRADES: 3 4 5 3 4 5 
———————————————————— ÉL 

15 8.7 8.1 6.5 6.6 
20 Re 7.6 59 5.8 6.0 6.0 
25 7.0 7.2 5.4 S1 5.4 5.6 
30 6.2 6.9 5.2 4.5 4.9 5:2, 
35 5.8 6.6 5.6 3.9 4.5 5.0 
40 5.6 6.5 6.2 35 42 5.0 
45 5:5. 6.7 6.6 3.2 4.1 54 
50 5.7 7.0 6.7 3.0 4.2 5.2 
55 6.0 74 6.3 33 4.4 5 
60 6.3 7.8 5.8 43 4.5 4.8 
65 6.9 8.2 5.0 4.6 4.5 
70 7.8 8.6 5:5 47 4.0 
45 5:9 
Weighted average standard error 6,2 7.1 6.1 4.4 4.6 5.1 
Reliability coefficients 
(equivalent forms method) 85 .80 .85 .92 .92 .90 
Number of cases 2659 1419 834 2659 1419 834 


——— (DN и 


Source: Adapted by permission of the publisher from Irving Lorge and Robert L. Thorndike, 


Technical Manual, rev. ed. (Boston: Houghton Mifflin Company, 1962), p. 11. 


SUMMARY STATEMENT 


A. person's test score summarizes data on his 
tasks or test items. The concept of reliability is 
of measurement, or the extent to which an ind 
sample to another of the same type of behavior. 


performance on a sampling of 
concerned with the consistency 
ividual's scores vary from one 


Reliability 101 


If the same test is readministered on two different occasions, we obtain data 
on variance in scores due to temporal variations in the examinees. If two forms 
of a test are administered on the same occasion, we obtain data on variance in 
scores due to specificity of the samplings of test items. 

The various sources of inconsistency in examinee behavior from one testing 
to another are summarized in Table 3.1; while Table 3.4 clarifies the extent 
to which each of these sources of variance is taken into account by the different 
approaches used in the estimation of test reliability. 

Tables for approximating Kuder-Richardson reliability coefficients and stand- 
ard errors were presented in order to enable students to utilize and interpret 
these measures without necessarily developing proficiency in computation. The 
many factors involved in making comparisons between reliability coefficients 
presented in test manuals were considered, namely (1) the method used in esti- 
mating reliability and (2) the ability range of the groups studied. 

Although reliability coefficients are most useful in assessing the comparative 
reliability of different tests, the standard error is more valuable in the interpre- 
tation of test scores for individuals. 

The reliability of differences between pairs of scores is much less than the 
reliability of either score. If two tests measure closely related abilities, their 
reliability coefficients must meet high standards if one is to interpret difference 
scores with a reasonable degree of confidence. For examples of this relation- 
ship, the reader is referred to Table 3.8. 

The reliability of a test, or the consistency of student scores from one test 
sample to another, depend largely on the length of the test, the homogeneity of 
the universe sampled, and the objectivity of test scoring. 


SELECTED REFERENCES 


LESER, Psychological Tests and Personnel 
f Illinois Press, 1957. 

Teacher-Made Tests. Evaluation and 
on, N.J.: Educational Testing Serv- 


CRONBACH, LEE J., AND GOLDINE C. G 
Decisions. Urbana, Ш.: University o 
DIEDERICH, PAUL B., Short-cut Statistics for 
Advisory Service Series No. 5. Princet 


ice, 1960. Available on request. 
LORD, FREDERIC M., “Tests of the Same Length Do Have the Same Standard 


Error of Measurement,” Educational and Psychological Measurement, vol. 


19 (Summer 1959), pp. 233-239. | " 
— ———, “The Utilization of Unreliable Difference Scores," Journal of Educa- 


tional Psychology, vol. 49 (June 1958), РР. 150-152. | . 

SUPER, DONALD E., AND JOHN О. CRITES, Appraising Vocational Fitness, rev. ed. 
New York: Harper & Row, Publishers, Inc., 1962, Chapter 3. 

THORNDIKE, ROBERT L., "Reliability," in E. F. Lindquist, ed., Educational Meas- 
urement. Washington, D.C.: American Council on Education, 1951, pp. 
560—620. 

WESMAN, ALEXANDER G., 
New York: The Psychological 

, “Reliability and Confidence, 
The Psychological Corporation, 


“Better Than Chance,” Test Service Bulletin No. 45. 
Corporation, 1953. Available on request. 
» Test Service Bulletin No. 44. New York: 
1952. Available on request. 


102 THE EVALUATIVE PROCESS 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. What are the major factors that influence the reliability of a test? 

2. When is a test objective? How is objectivity related to reliability? 

3. Discuss the relative merits of each of three methods use 
the reliability of a test. 

4. Study the manuals for two or more stand 
for all subtests. Summarize and evaluate th 
using the test to measure the achievement of 
ble are the differences between scores on va 

5. Illustrate how a teacher should use the 
his interpretation of test scores. 


d in determining 


ardized tests, noting the reliability 
€ evidence presented in terms of 
groups. Of individuals. How relia- 
rious pairs of subtests? 

Standard error of measurement in 


conclusion? 
7. Discuss the importance of selecting or developing a test of suitable diffi- 
culty level for a specific group of students. 
8. Why is the reliabilit 
sidered to be a “соећсіе 
test is used? 


y coefficient obtained by the “odd-even” 


method con- 
nt of equivalence,” 


' even though only one form of the 


4 Validity 


The term “validity” is used to apply to a test's value as a basis for making 
judgments about examinees. A single test may be used for making several 
types of judgments; its validity may be high for one purpose, moderate 
for another, and low for still another. Hence, we cannot speak of a test as 
having high or low validity without specifying the purpose for which it is 
to be used. 

The term "purpose" is best interpreted as including both the type of 
judgment to be made and the nature of the group involved. A test in busi- 
ness English may be valid for differentiating among high school students 
to make judgments basic to grading. The same test may make little con- 
tribution to the goal of differentiating among applicants with respect to 
predicted success in secretarial positions. Hence, validity is always validity 
for a specific purpose (to aid in making a specific type of judgment con- 


cerning members of a specific group). 
Validity has two major aspects—reliability and relevance. For a test to 


be valid, that is, to provide a sound basis for judgments, it must measure 
“something” with reasonable reliability, and that “something” must either 
be a sample of the behavior we wish to measure or it must have dem- 
onstrated relevance to that behavior. Reliability, or the consistency of 
measurement, was studied in Chapter 3. In this chapter, we will be chiefly 
concerned with relevance—the relationship of scores on the test to the 


criterion behavior in which we are really interested. 
TESTS AS DIRECT OR INDIRECT MEASURES OF 
CRITERION BEHAVIOR 


Tests as Direct, Unbiased Samplings of Criterion Behavior 


When the content of a test is a random sampling of a defined area of 
content, relevance is not a problem. An example would be the spelling 


103 


104 THE EVALUATIVE PROCESS 


test designed for use in problem 3 on the use of teaching machines | 
spelling instruction, The criterion behavior we wish to measure in this 
situation is student performance on the population of 500 words. Since the 
sample is а random one, it is unbiased and perfectly relevant. Hence 
the validity of this test is determined entirely by its reliability (which 
depends chiefly on size of sample and objectivity of scoring). It is rare, 
however, that test content constitutes а random sampling of criterion 
behavior. Whenever the sampling is not random, human judgment enters 
into the selection of learnings to be tested. 


Tests as Indirect Measures of Criterion Behavior 


In most tests, we do not sample the criterion behavior in which we are 
really interested. For purposes of efficiency in scoring, we introduce 
multiple-choice items when we are really interested in how well students 
can compute, punctuate, spell, and the like. Hence, we must make sta- 
tistical studies to determine how much irrelevant variance has been intro- 
duced by our use of such indirect methods. We can correlate scores on 
our indirect measure with scores on the same test items, presented in the 
direct manner, in which students actually work out the arithmetic prob- 
lems, punctuate the sentences, or spell the words.' In this situation our 
criterion is the score on the test that demands student recall of informa- 
tion, the actual working of problems, or spelling of words.? 


In many situations it is very difficult to assess the validity of a test as 


1 Hopkins uses the term “coincident validity” ог “extrinsic reliability" for the cor- 
relation between scores on a multiple-choice test and scores on the same test, or an 
equivalent form of the same test, administered as supply-type items (that is, exclud- 
ing the alternative responses for items). Kenneth D. Hopkins, “Validity Concomit- 


ants of Various Scoring Procedures Which Attenuate the Effects 


of Response Sets and 
Chance,” 


2," unpublished doctoral thesis, University of Southern California, 1961. 

2 This type of correlation coefficient may be perceived as a validity coefficient in 
that it takes into account stable but invalid variance associated with the use of selec- 
tion-type items. Such variance lowers the relevance of the test scores to the criterion 
behavior, in which we are really interested. Such coefficients may, 
ceived as reliability coefficients in that they reflect types of error va 

coefficients are estimates of the consistency of measurement by methods that are 
maximally similar, while validity is concerned with agreement or convergence among 


scores that are obtained by quite different methods. On such a continuum of simi- 
larity vs. diversity of methods, the type of coefficient we are considering would, in 


however, be per- 
riance. Reliability 


the opinion of the author, lie farther from the maximal similarity-of-method end 
of the continuum than the subdivided-test method, or even the јела Рат 
method, but would still be more of a validity coefficient than а reliability coefficient. 
This is a subjective judgment, however, and Hopkins (footnote 1) prefers the term 
"extrinsic reliability." This explanation is intended more to illustrate the fine grada- 
tions between methods of studying error variance than as Support for the choice of 
one term or the other. 


Validity 105 


an estimate of ultimate criterion behavior. For example, we may wish to 
make judgments about a person's driving ability on the basis of his per- 
formance on the test he takes to obtain his driver's license. The validation 
problem becomes one of relating driving test scores or ratings to the 
ultimate criterion of "success in driving." However, we face a difficult 
problem in that it is almost impossible to obtain reliable measures of the 
ultimate criterion, that is, scores that represent an objective evaluation of 
day-by-day criterion behavior as a driver. 


Tests as Predictors of Future Criterion Behavior 


Let us consider another practical problem, that of making predictions 
about students’ future success in an activity, for example, clerical work. 
If we want to evaluate a test as a basis for making such judgments, actual 
success in clerical jobs would be the “ultimate criterion.” We may decide, 
however, to use teachers’ marks in a clerical practice course as an “inter- 
mediate criterion" for judging the validity of a clerical aptitude test. If so, 
we should study the relevance and reliability of this intermediate criterion. 
Teachers’ marks in the clerical practice course may have low relevance 
to the ultimate criterion of success on the job. In studying relevance, we 
would investigate the extent to which the clerical tasks and standards of 
performance in the course are representative of those on the job. Both 
relevance and reliability would be affected by the degree to which teachers 
based their marks on subjective general impressions and unconscious bias, 
rather than on objective data on student performance. 

The test-maker frequently has to check his test against an "intermediate 
criterion" that may not be closely related to the "ultimate criterion" in 
which he is interested. For example, success in a clerical practice course 
may have only a low or moderate relationship with the ultimate criterion 
of “success on the job.” Obtaining suitable measures of criterion behavior, 
to use as standards for validating tests, has been one of the most difficult 
problems in test validation; it has stimulated study and discussion among 


outstanding leaders in the field of measurement.? 


з For further treatment of this problem, see Robert Hoppock, ed., “Criteria of 
Vocational Success—A symposium," Occupations, vol. 14 (Јипе 1936), рр. 917–975; 
R. M. Bellows, “Procedures for Evaluating Vocational Criteria,” Journal of Applied 
Psychology, vol. 25 (October 1941), pp. 499-513; Edward E. Cureton, “Validity,” 
in E. F. Lindquist, ed., Educational Measurement (Washington, D.C.: American 
Council on Education, 1951), pp. 621-694; Edwin E. Ghiselli and C. W. Brown, 
“Analysis of Jobs,” Personnel and Industrial Psychology (New York: McGraw-Hill 
Book Company, Inc., 1955), pp. 17-58; D. B. Stuit, “The Effect of the Nature of 
the Criterion upon the Validity of Aptitude Tests,” Educational and Psychological 
Measurement, vol. 7 (Winter 1947), рр. 671-676; Donald E. Super and John О. 
Crites, Appraising Vocational Fitness (New York: Harper & Row, Publishers, Inc., 
1962), pp. 32-41; Robert L. Thorndike, Personnel Selection: Test and Measurement 


106 THE EVALUATIVE PROCESS 


Fortunately for the test-maker, some intermediate criteria have sig- 
nificance in their own right. These criteria, in a sense, carry their own 
labels of success or failure, such as graduation from high school, retention 
of a job, making at least a C average during the freshman year in шера. 
That is, the ultimate criterion behavior, involving "Success on the job, 
can be exhibited only by persons who pass certain intermediate hurdles. A 
person will have no opportunity to show his performance on the ultimate 
criterion of success in a specific profession unless he first scores sufficiently 
high on college admission tests, earns a certain grade-point-average in 
college and professional school, and passes some type of a licensing exami- 
nation. Hence student performance on any of these intermediate criteria 
can serve as a partial basis for validating aptitude tests. Moreover, teachers’ 
marks and supervisors’ ratings, even though biased, are socially significant 
in their effects; and hence the relationship of test scores to these criteria 
is worthy of study, 

We must not assume, however, that the relationship of intermediate 
criteria to ultimate criteria is unimportant. It is to the advantage of both 
society and the individuals concerned if we guide into training programs 
those persons who will ultimately be successful in the 


professions. Success on preliminary training hurdles 
not sufficient. 


actual jobs or 
is necessary but 


TYPES OF JUDGMENTS MADE ON THE 
BASIS OF TEST RESULTS 


When we construct or select a test, our chief concern is that the test scores 
enable us to improve our bases for judgments about the examinees. In 
the Technical Recommendations, four types of judgments that test users 
desire to make are listed as a basis for clarifying the different types of 


validity studies which need to be made. These types of judgments, or 
purposes? in testing, are stated as follows: 


1. The test user wishes to determine how an individual would perform at 
present in a given universe of Situations of 


which the test situation constitutes 
a sample. 


Techniques (New York: John Wiley and Sons, Inc., 1949), Chapter 5; H. A. Toops, 
“The Criterion,” Educational and Psychological Measurement, vol. 4 (Winter 1944), 
pp. 271-297. 


* [n this statement of generalized 


4 purposes, only the type of judgment is indicated 
and no reference is made to the group of individuals, about Which such judgments 
will be made. In evaluating tests for local use, however, both the type of judgment 
to be made and the nature of the group tested must be considered. 


Validity 107 


2. The test user wishes to estimate an individual's present status on some 


variable external to the test. 
3. The test user wishes to predict an individual's future performance (on the 


test or on some external variable). 
4. The test user wishes to infer гле degree to which the individual possesses 


some trait or quality (construct), presumed to be reflected in the test perform- 
ance." [Italics added. Items 2 and 3 have been interchanged to correspond with 
order of presentation in the text and Tables 4.2 through 4.9.] 


Each of these four types of judgments involves a different focus of 
concern with respect to test validity. When we are making judgments of 
the first type, we are chiefly concerned with content validity (how well 
our test sample represents the universe of criterion behavior); in the 
second we are interested in concurrent validity (how closely test scores 
are correlated with present criterion behavior); in the third we are con- 
cerned with predictive validity (how well test scores predict future criterion 
behavior) ; for the fourth purpose we are concerned with construct validity 
(how well our test seems to measure the hypothesized trait—as shown by 
the effectiveness of the test in differentiating among groups that are pre- 
sumed to differ with respect to the trait, and also by the relationship of 
test scores to predicted behavior in natural or specially designed situa- 
tions). Each of these types of validity will be discussed and illustrated. 

It is impossible for the authors of a standardized test to provide com- 
pletely adequate data on the validity of their test for all purposes (judg- 
ments and groups) for which test users might conceivably employ the 
test. The authors should, however, provide data that enable the test user 
to judge whether the test is likely to be valid for а specific purpose. Once 
he has selected a test on the basis of such a hypothesis, he should collect 
local validation data to check on the test’s validity as а basis for making 
the type of judgments he wishes to make about the students he wants to 


select, classify, or counsel. 


CONTENT VALIDITY 


Random Sampling of a Universe as а Basis for the First Type of Judgment 


The reader will recognize that when we used the local spelling test we 
wished to make the first type of judgment (listed above) in order to assist 
us in “evaluation of treatments,” that is, evaluation of the efficacy of a 


5 “Technical Recommendations for Psychological Tests and Diagnostic Tech- 
niques," Supplement to the Psychological Bulletin, vol. 51 (March 1954), p. 213. 


108 THE EVALUATIVE PROCESS 


specific teaching-machine program. Our test is a random sampling of the 
spelling words studied; hence our criterion is student performance on 
the defined universe of 500 words. It is not feasible to use student per- 
formance on a 500-word dictation spelling test as our criterion, but we can 
estimate the validity of the shorter test from its reliability coeflicient. : 
The shorter test is a perfectly relevant sample of criterion behavior 
except for the sampling errors involved in using only 25 of the 500 words. 
If the reliability coefficient? is .64, we can estimate that the validity coef- 
ficient is .80 (the square root of the reliability coefficient). This coefficient 


is an estimate of the correlation between students’ scores on one s 


ampling 
of 25 words and their *true scores" 


(theoretical scores on the entire 
universe of words). In other words, when we can exactly define the uni- 


verse of criterion behaviors we wish to measure, and can obtain unbiased 
measures of student performance on a sample of this universe, the validity 
coefficient of the sample test depends entirely upon its reliability. 


Achieving Representativeness of Sampling When Random Sampling Is Not Feasible 


When we measured student achievement in history in problem 4, we 
were also concerned with making the first type of judgment; that is, 
we wished to make inferences, from student performance on sample tests, 
concerning their knowledge of the history of their state and nation. 
Defining the universe of knowledge of history, however, is intrinsically 
more difficult than for spelling. In fact, professional judgment concerning 
the relative emphasis on different areas is required. Since all the teachers 
used the same textbook in state history, and the course of study indicated 
roughly the division of time to be allotted each major area, the teachers 
could agree on the percentage distributions listed at the bottom, and in 


the left-hand column, of Table 4.1. On the basis of this sampling plan, 


they were able to devise a test that fitted the teachers’ specifications 
quite closely. 


One can see by comparing the “total” column with the percentages at 
€ distribution of items by chronological 
pect to aspects of history, the specifica- 
ms on the political and military aspects 
30 percent for each of the other aspects 


Validity 109 


Table 4.1 
Table of Specifications (or Sampling Plan) for City-Wide Test in 
State History (100 Items) 


Number of Items on Each Aspect 


of History 
POLITICAL SOCIAL 
AND AND ECO- 
CHRONOLOGICAL PERIOD (AND DESIRED MILITARY CULTURAL NOMIC 
EMPHASIS ON EACH PERIOD) ASPECTS ASPECTS ASPECTS TOTAL 
Exploration and colonization (10%) 5 2 3 10 
Establishment of state 
government (1596) 11 2 2 15 
Development of the new state (15%) 6 4 5 15 
Involvement in the War be- 
tween the States and the 
Reconstruction period (10%) 4 2 4 10 
Industrial and cultural devel- 
opment (1875-1910) (1596) 3 5 7 15 


World War I, the postwar 

period, and the depression 

years (1596) 4 4 7 15 
World War II to the present (20%) 
a 


Total number of items (by 
aspects of history) 


Desired emphasis on each 
aspect 40% 30% 30% 


ыз Ыыы Á- 


RI 
an 
~ 
S 


40 25 35 100 


was only approximately achieved. Note that no attempt was made to 
balance the questions on aspects of history for each chronological period. 


Appraising the Content Validity of Standardized Tests for Local Use 


With the test of United States history, teachers were also concerned 
about the first type of validity, content validity, that is, the extent to which 
the content of different standardized tests represented the universe of 
content they wished to sample. In this case, they decided on a two-way 
table of specifications involving both type of objective and area of subject 


matter5 They decided that only 40-50 percent of the items should be 


concerned with memory of knowledge; that 30-40 percent should require 


that the student demonstrate his comprehension of what he had learned 


3 For an example of such a table of specifications (involving both objectives and 
content) the reader is referred to Table 10.1 of Chapter 10. 


110 THE EVALUATIVE PROCESS 


through his ability to interpret trends, explain cause-effect relationships, 
and the like; and that 10—30 percent of the items should involve such 
higher abilities as application, analysis, and evaluation. The terms in italics 
refer to five of the six categories in the taxonomy of objectives,” which 
will be reviewed in Chapter 11. 

On the basis of the pooled judgments of teachers concerning their 
desired emphases, the committee developed guide lines for examining a 
number of standardized history tests for their content validity for their 
purposes. The committee discovered that some test manuals gave 
information in their own tables of specifications so that it was 
tively easy to judge the relevance of the test to local curricul 
For other tests it was necessary to go through the test 
attempting to classify the items according to objective and content. 

The committee concluded that no test actually fit their local specifica- 
tions closely. However, two tests were worthy of further study in that 
they had adequate reliability coefficients, fair content validity, and met 
other practical criteria regarding ease of admi 


nistration, Scoring, and the 
like. One of these tests corresponded quite well with the teachers’ proposed 
emphasis with respect to content but not with г 


espect to objectives. This 
test included 70 percent items of the "knowledge" type, 30 percent “сот- 
prehension," and none representing the other types of objectives. The 
second test, although it included far too little content on the history of 
the United States during the twen 


tieth century, had a better distribution 
with respect to the objectives measured. 


The committee decided that, since the community was most interested 
in knowledge, they would choose the firs 

desired distribution with respect to conten 
plementary test of their own th 


higher abilities of interpret 
th 


adequate 
compara- 
ar emphases. 
, item by item, 


t test, which approximated their 
t. Then they would devise a sup- 
at would include items which tapped the 
ation, application, and so forth. They realized 
at the second test, with its minimal emphasis on recent history, would not 
constitute a fair basis for making judgments about the effectiveness of their 
instruction in furthering the knowledge objectives of American history. 


tructing and Validating Tests 
for the First Ригроѕе10 


Table 4.2 is concerned with the c 
designed to serve the first 
examples of content validity, 


onstruction and validation of tests 
Purpose. This table not only gives further 
but specifies the procedures that should be 

9 Benjamin S. Bloom, ed., Taxonomy of Educational О 
Cognitive Domain (New York: David McKay Company, 
10 As listed in the “Technical Recommendations for 
Diagnostic Tests," op. cit, and quoted on pages 106- 


bjectives, Handbook I: 
Inc., 1956). 


Psychological Tests and 
107 of this textbook. 


11 


Validii 


‘paudisop sem ,juridon[q, 3501 oy) моцм oouvApe ur pouuvd sem se (олпоогао pue воле Aq) ѕшәу jo uonnquysıp ојештхола 
-de оф шејшеш 0} Se os $шәй 129195 PINOYS 10jon1jsuoo 189] ou] 'sjuopnjs цім jno pori рив пәм иәәд JAVY $шәй ләү '2 
"ejqestpe 4081 
SE ;P[9g yoofqns əy} ur ssostArodns pue 5лоцовој рошел Á||euorssojoud jo зиошдрпг əy} Зицоой 521 рогтрлериеј e ло (7) 
"воле joofqns aures əy} ur s12310402 JO әзоцу YIM sjuourapnf 
SIY Yayo 0] o[qvursop 11 puy Аеш oq чдпоцуе sjueuigpní osouj одеш џез ујәѕшіц лоцовој oui ә) ореш-лоцова) ? 104 (0 
"олпоо[до 
lofeur qovo pue juojuoo jo голе цово uoAIS од о} siseqduio олптјал I} uo оргогр pinous “йпо18 [euorssojoud v 10 ‘әцәвәј IYL 
"uodn poproop од p[nous oxi Əy} pue ‘spys ‘ә8рәјмоџ јо Surdwes poyneujs e Sunoo[os 10у siseq e 'oouoH 'рошу 
әр Á[reo[ se од jouuvo ројашт5 oq oj soniiqe рие juojuoo JO ASIOATUN ou) ‘s}S9} }иәшәләщәв JSOU jo uoronjjsuoo oui UJ 'q 
ә'рәѕп од ued Sur[dures шорџил ‘swa jo облолшп pougop Ајлвојо *ojtug v Furdwes 51 ouo JJ "e 
*"pequosop Á[1eo[» әд иво jeu) золпрозола Aq озлолшт oy) o[dureg @ 


v'P?1pnis aq о} JOIAvYaq jo sjoodse jo sui19) Ш роја што oq оў ә$ләлїип oq PANSIJA сд 
оўуер 3ururejqo Jo Аипаз®зд (Z) 
4591025 1591 шол} одеш 0} YSIM ПОК sooua19jut jo sedÁT, (1) 

JO 5шло) ur po[duues од 0} әѕләліип Ou) joujsaw "e 


"pojdures oq 0} suonenijrs pue juojuoo jo osioAtun од) UYA '1 
ASOdUNd SIHL моя SISAL JO LN3J/XdOTSA3Q AHL моз занпааоона аз211учамао 


"suonenjrs Jo обломшп рәцуәр v jo Sui[dures e uo озиешлојлод үепрїлїрш JO рлозол v ортлола sjsoj oso) у “OXI әш 
pue '(ооиешлојлоа 8шлїыр jo o[duues v se yons) so[dures-y104 “51$ә} juouloAanjov ореш-лоцовој чпошолотов JO 51521 pozipreputjs 


asOdund 5ІНІ YOA GASN SISIL 40 SAdAL 


921080291 UVI 
Рэ e 5лореол opei3-jsuj ur popn[our A[[ensn sp1ioA ou) JO Áueur MOY ојешп5о 0j :ле24 [00425 SIY} parpnjs seu oq sp1oA 
Əy} s[[ods рпцә v A[oje1nooe моц ojeuinso oj ‘ѕиоціриоо ILI) ULQIN [eordÁ) ur Jed v ѕәлир UOSIad е MOY ojeulnso OL :STIdWVXd 


әјішуѕ p saimjusuoo 1521 ays ya fo ‘suouvnys [o „astaan, рәиуәр p и! 'ju2s24d w "шлојага pjnom uos4ad р moy apunsa ој 


(паун яя OL IN3J/NDGanf AO adAL) ONLLSAL NI WIV 


e»upuuojeg jueseig Buljduing—,asoding 1514 ƏY} әллә ој Pas eg ој sisa| јо иоцорџод PUD иоцопцѕиоо 
Єў 91991 


THE EVALUATIVE PROCESS 


112 


—————M———MM————— —— ———À 
‘suoisioap Yons Suryeu ш pasn oq рјпоц 552) ay} JO }п0-41 е uro ејер uo poseq *juoulgdpn[ [euorssojo1q 


"JULAI 10 [TALI] әле је] ѕшә '2 

"роја шео 
e[qerieA OY} uo sjuopnjs 3ulAomqov-^o[ рие 8шләщәю-ц8щ uoaAjeq 8ипепџалојр о} uonnqrguoo ou lO әр одеш jeg] sua 'q 
грош вор sem jso] Əy} Чол Јој dnoJ3 ay} јој j|noUjtp 00] JO Asva 00} әле jeu] SWAI ‘е 


:(sjuopnijs xuei о} posn 1521 19470 Aue JO ‘иопешшехә ләѕәшәѕ 10) 1521 JUS 
-әлә!цәв pozrpiepurjs v jo uonipo Атешшцәлі əy} шолј әјешшцә ој JUANDA sr 1] “OWN Sunsa, әјдееле јо ореш oq рјпоз osn 
шпшпао jeu; os јшәшәлпѕвәш jo Аопотоцјо əy} osealour ој posn soinpoooid Sururoouo2 рәјиәѕәла од osje poys џопешојиј ‘є 


"IIS 10 вале juojuoo узез 107 suiojr jo лодшти au], 'q 

5шој 159) BUMS 107 Saseq ou], "€ 

Зшріедәг попешлој 

-ur juopogjns urejuoo Pinoys (озиешлојлод jo 3uj[dures зәцзо 10) 1591 зшәшәлә!цов uv 10j suonvoyr»eds jo o[qej 10 junidon]q 941, 


"sum jo UOISNJOXS JO uorsn[our 10} VIO AL ‘С 

VUMBIP 910A sway ЦоаА шолу ѕә22п05 ML '[ 

гојезтрш p[nous [enueur oq T, 'pajejs 

K[1vo[o әд рүпоц$ әѕләлтип pougop əy} ur oouvuroj1ed jo oAnvjuosa1dol st o|dures 152] IY} је) олп5џо ој posn VIALI pue so21nos ML 


ISAL V dO ALIGIIVA .ING.LNOO HH. омамуоан птупмум LSAL V NI азалома яя GINOHS LYHL NOILYWUOANI 


q’ Seale juologIp ur sjuopnjs [enplAIpur jo јџәшәләтцов Aane Iy} зподе soouoJojur 
ројшеллемип роле 0} әле ом JI [епиәѕѕә SI 521025 jsojqns JO ÁiIqeroi ou) 8шиләоиоә әошәрілә ‘st jeu) :jso)Qns jeg) JO Áouo 
-Jsisuoo јешлојш OY} jnoqv pojuosoid oq р[поцѕ әоиәртлә *(peinduroo st 21025 е цаца ло} ѕшәј jo dnoj3) jsejqns Aue 104 'q 

| агројаше5 обломшп ay} JO sjuouoduroo спозџодошоц Ајолпејој o[duues jeu] sjsojqns ojur родпола oq p[nous swa)! 
152] 'олојалоц) '521008 1521 JO uonvje1dojur [nj3urueour 104 'ројејоллојш Аүц8 Jou әле jeg] son!|Iqe JO ләдшпи v oJnsuoul 
Аеш K19]]uq VY} JO $шә I} “шивә JO os1oArun рвола AIDA V e[dures 0j рорџојш әле зомојјед 150] јпошологузе Аџеш әш ‘в 


"sisajqns snoauasowoy Ајзапејел ojur padnoss oq p[nous шәй əy} “шәјиоо ш ѕпоәџәдошоц АЗ SI 1591 [010] ӘЧ) 5590] `E 


NN 
әзиошіорләд 1195814 Burjdug—,esoding 4514 әң әллә ој резр eg ој sisə, jo ионорі|ол pup иоцопцѕиоо 
(репициођ) z'p зјаој 


113 


Validity 


021] 
"qnd jo зафор HOY quA "pnupu оф paisi| әд Pinoys syooq asoyy 
^sXooqixe, posn Аңиәмпэә uloa jo juojuo» əy; uo ѕшәџ ѕи рәѕод 
SDY 45934 Pezipabopupis p И 'exi| 94; pup 'јалеј Алојпаозол ,siuapnis јо 
91п5рәш D 10} бром бицэәјә лој sisoq əy; so рәѕп әзәм 545] Kioja 
79020А joym fisa} Виједе о 40} pejduios әләм (55 Bur]|ads pappunys 
40) suajjads joym mouy рјпоц 159 əy; Buisn џозлод əy; 'ојашохе 104 


‘02 "d “(у561 uxiow) Lg “JOA "unajng үоэ!боүоцэАзд 
эч OF jueue|ddac „изепбишузој onsoufpiq pup sysop |p2180jou»Asq 
10} зиоцорџгшшозом 1021ицэәј,, шо} pejont) „Ттушмаба "20420 
чошшоз цим pojpinjos si 91025 Əy, Qu^ 04 4uojxo Əy} ојодеиошер 
©} '91025 олца извор 4209 зо} реџодол од pjnoys Азџојизиоз јоилај 
~Ut jo шершаоз р 'әззәл!ип D шолј ојашог p sp рәр1оВәз oin ѕшәџ 
JI, '5^o|[o 50 ѕррәз ѕџоцориәшшоэәу |оз/ицзај oui jo £p» чомој 


751501415 snoouoBouou Arp uo әјдојіоло әләм 581025 4! рјпоз ом sp 
}цәшәлә!цәю suapnys подо 5езџојоји! іпубширәш so әуош фошио2 ом 
'абопбиој jo зә; snqtuuo ouo и! әбоѕп обопбиој pun 'usiBug jo 
səruoyəəw ^Buijjads ^Buripupu uo зшоџ грпри! əm j! 'ajdwoxə 104 


"5594 puawaaayoo ш рәрпрзи! од ој sway! 4ѕә Би! 
70292000 51051л199п5 pup sjauppo, jo зиоцзрел oui peurpiqo osjo әлоц 
si9usiqnd лешо Аџош pup npaing 49] D!uJOgD. 94ј 'soojuuos 
9^tjpjuosojdoi asayy jo Sdaquew Ад əuop som Buyum шәй Аюш 
"шпјела ayy jo урпш 44D} Шр зәр цорә јој suoupogiseds jo ајаој ло 
jundon|q әці 401p рјпом {оцу ѕәәцішшо> uo жом о} јәџиоѕзәа Bui 
-U495u02 5иоцорџешшоза ч 101 реупзиоз asam 519203, о suol} 
-2:!ирбло [puorsojoid |puonpu 'sisoj 4315 24; jo noun jpelqns әрә 


404 ‘soap juojuo» juo1agip рир ѕәлцэә!до juo1og!p ој ueAIB eq ој sis 
-pydwia |оиоцлодола eu, Buruje»uo» suoisi»op әўош jou p|nous џедхе 
}чәшәлп$рәш əy; jou; paziuBore1 spy езллос Бицзај јоџоцозпра әчү а 


"ром 006 jo омамип о woaz рлом yoz Азәлә рәзәүә$ s1eu»po, оф 
"ађог реџоцџеш зә} Buj[eds [poo| əy; бшѕіләр ш 'ајашохе 104, 


'8suo, swaes eu лецјоцм JO 'suoissa1B04/94 
puo sesnod juonbaij ѕәҳош əy soyyoym "uomseadxe цим зроел рапа 
Əy; jou 10 лофјецм jnoqo џоцошлојш! рлозол jou [им әм 'poziuBooo1 
Sp10^ jo зәдшпи ayy Ајџо ejou oj орђер Хош әм ‘Bulppai ш әзир 
"шаојод ,sjeppiB-s1g jo әјішоѕ о BuiApnis aio o^ }! 'ојашохе 104 $ 


"Aupiqo 
Бијлџр jo spədso peinspouun {подо soouoj9ju! Бирјош ui payysnl 
әд jou |м ƏM joy; әт!ибозәл рир ‘pajduins озломџп Əy} pisoi әм 
‘a2uay 'sesue»i| бшлир 1ој sjupoi|ddp зо} ojduips 4594 әчү и! оглоцед 


Buriup-uipjunour әрпјэш ој ejqispoj и! риу you Лош əm 'ојашохе 104, 


јаррзб sty 104 sayjads ayy 
up зрлом ayy Auo әүїшо$ |р ом "Ano Buyjjeds yo јелај |јо-лоло 
514 чоц лоцјол /јәләј әрюлВ siy jp зрлом Buijads peuBissp јо Bui 


7uiD9| sjuopnis о jnoqo зушешбрп! ојош ој ysim әм у 'ојашохе лод a 


`Є1@ `4 “(#561 YEW) 16 "ол "ицејупа 02/60) 
-ou»Ásd 94; Of jueue|ddns _,,’senbiuyray эңѕоибша pup siso| јоз 
т!Војоцо4ѕ4 40} ѕиоцорџиәшшоэәу jo2!uy>01,, ш peisi| ‘Buysa; jo suip 
1n0j əy; jo sBulsosydas әлр ^g рио ‘Cry “єр фр Se|qp] jo uonpziupB 


м9 jo 51509 oui so pasn рир '01-901 "9 uo рәр! шо лпој eu], 


— MÀ e a 


9»upuuojied jueseig Buljduing—,asoding 4514 Əy} оллеб ој резр ag oi 559] jo uouppi|bA pup иоцэпцѕиоо 
(репициођ) Zp гјаој 


114 THE EVALUATIVE PROCESS 


3 i i le a universe of knowledge or 
llowed in developing tests which samp | ‹ ge 
nee In the last section of the table are listed the types of information 
that should be given in the manual of a published test so that the user 

could judge the content validity of the test for his own purposes. 


CONCURRENT VALIDITY 


The most typical example of the second purpose in testing (estimating 
the individual’s status on some attribute external to the test) arises when 
a test is being used as a more economical, convenient substitute for ac- 
cepted appraisal procedures. A group intelligence test may be substituted 


for an individual test; or a multiple-choice spelling test may be substituted 
for the students’ actual spelling of words. 


When a shortcut substitute for some more elaborate standard method of 
measurement is proposed, the question of the validity of the substitute method 
does arise with logical legitimacy. In such a situation the c 


oncept of validity is 
simple, and the meaning of the term is clear.!: 


The Construction of a Test Specially Designed for the Second Purpose 

An excellent example of the use of concurrent validity studies in test 
construction is the work done by the College Entrance Examin 
in developing objective tests of composition skills.'? The Bo 
been concerned about the time involved in judging е 
tivity of judges’ ratings. Hence, their staff has en 
research to develop objective tests, designed to co 
criterion of student performance in essay writing. 


As a basis for obtaining criterion scores, each student in the research 
group wrote five different e 


judges. The criterion score 


ation Board 
ard had long 
ssays and the subjec- 
gaged in considerable 
rrelate highly with the 


11 Robert L. Ebel, “Must All Tests Be Valid?” The American Psychologist, vol. 16 
(October 1961), pp. 641-642. 


12 Annual Report, 1961 


-62 (Princeton, N. J.: Educational Testing Service, 1962), 
p.99. 


Validity 115 


Several types of items were found to have concurrent validity and were 
included in the tests because of their correlation with the composite 
criterion score on essay writing, for example: 


1. Multiple-choice questions requiring the student to choose, from a number of 
alternatives, the best expression for an indicated word or phrase 

2. Multiple-choice questions requiring the student to classify sentences accord- 
ing to whether they (a) contained an error in diction, (b) were verbose or 
redundant, (c) contained cliches or abused metaphors, or (d) contained 
faulty grammar 

3. Exercises requiring the student to select the appropriate line to complete a 
poem and indicate whether each of the other alternatives is (a) inappropriate 
in rhythm or meter, (b) inappropriate in style or tone, or (c) inappropriate 
in meaning 

4. Prose exercises similar to (3) above, except that the rejected alternatives 
are to be classified as (a) inappropriate in meaning, (b) inappropriate in 
tone or diction, ог (с) grammatically defective. 


Summary of Procedures Involved in Constructing and Validating Tests for 
the Second Purpose 


In developing this test of English composition skills, the authors’ pur- 
pose and procedures paralleled those listed in Table 4.3 on concurrent 
validity. This was not a situation in which test authors could sample a 
defined universe of content and abilities. Instead, they had to (1) devise 
an adequate criterion measure and obtain criterion scores for a large 
number of students, (2) devise a large number of items on the basis of 
hypotheses about test items likely to tap the abilities used in writing essays, 
(3) study examinee performance on each item in relation to criterion data, 
and (4) select items for the revised test that maximized the relationship 
between test scores and the criterion data. 


Concurrent Validity Data as a Partial Substitute for Data on 
Predictive Validity 


Since concurrent validity coefficients, in which we relate test scores to 
present performance, can be obtained with less expense and delay than 
predictive validity coefficients (involving future performance), many apti- 
tude test manuals present concurrent validity data only. Evidence that a 
test has fairly high concurrent validity does not justify the assumption that 
the test also has predictive validity. In fact, Maurer™ discovered that 


13 Adapted from A Description of the College Board Achievement Tests (Prince- 
ton, N. J.: Educational Testing Service, 1962), pp. 19-39. 
у 14 Katherine M. Maurer, Intellectual Status at Maturity as a Criterion for Select- 
18 Items іп Preschool Tests (Minneapolis, Minn.: University of Minnesota Press, 
46). 


THE EVALUATIVE PROCESS 


116 


jTenueur 159) əy} ur uaAIS oq рүпоц$ uona oq) jo Азепборе оф Suruioouoo vjeqp T 
ISAL V JO ALIGIIVA INGNUODONOO AHL ONIGUVOTY IVANVIN 1591 Y NI ааалона яя алпон$ LYHL NOLLVJASIOANI 


SEEP UOHO)HO ƏY} put 1$ә} ƏY} uo 501025 шәәлајәд digsuomv[oi oy} ezmurxeur jeg) 352] Ррозтлол ƏY} јој $шәй joo[og 'с 


"ТеЈәџә8 ur 
əjdoəd jo ajdwes шорџел v "ға sonoinou pesougerp ‘ajdurexa јој 'sdnoiS uonapo 10; sway uo зоџешлојлод ay} виџедшо5 'а 
рушен qovo uo оопошлојлод рие 521025 UOHI U39419 попејеллоз əy} Зиппйшогу ‘ev 


Aq әреш oq p[noo Apnys siq, ‘erep UOLI9)HO оў попејол ш 1$ә} Árvurum[old оф Jo шәп yava uo оопешлојло4 oourumxo Apnig 'p 
тога ево "од pr JO ‘әле vjep uonapo шоцм ЈОЈ suosiod ој 152] oq) Jo uonrpa Árvurunpoid sup ләјѕішшрү сє 

o'UOLI9]HO оф 
ur pojuoso1do: syren 10 soni[iqe oui dey o} Ajoy swa}: 1521 jnoqv sesogiodíu jo siseq əy} uo swar 152] Jo 1oquinu ase] v ITAM ‘с 
as}9afqns лошјо 10 “уиәце@ *sooKo[duro *sjuopnis jo dnoi8 v 10j MUYE ƏY} jo олповош UONI 10 јзолр оф штао “I 
SSOdNüd SIHL YOA 51591 JO змамаолална AHL моя SAUNGTOONd аа21тунамао 


v (з5дојоц2А54 10 10[osunoo v о} ролојол эч PINoYs oua sjuopnis Jo uoroo[os оцу ur 10 'sor1089jvo опзоидетр ој sjuanvd јо quowuaisse 
oAnvjuo) oup ш) sjsijopouoÁsd Aq saoraiojur [епримрш 10у aININSqQns v se рәп 'sorrojusAur Ájeuosied рәло2-]әлпэә[д0 10 51521 
eAnoefoud oyy əy} рив ‘suonenba [еогшәцо олјо5 01 Купе 'Зицјода se spys uons Jo 5152) eorouo-o[dnjnur isuommpuoo payvjnuas 
Jopun ‘spys Supap se Yons 'spris ojetunso ој posn 5159) snyesedde :sjso] 9ouodrpejur Фпол8 :sjsoj зопоб још [enprrpur шл0]-110ц5 


aSOduNd SIHL МОЯ GASN SISAL AO 54441 


ssosayjodAy ures 8ип$ә} 20у опјел jo лорло Toy} ur 5упоштадхо pojse38ns Мишел ој шц Suryse Áq sjuour 

-пәйхә uvjd о} 43198 sjuopnis v әјешпѕә 0j fsonmqe Jejus 1521 0} решпоола зозролохо oAnoofqomues рие олпзогдо jo 

sonos v Aq шуша Ávsso ur Худ sjuvordde oSopoo v әуешпѕә oj j$9) dnoi3 v 10 559) је JO woy 1100s v Suisn Áq 
(иглррцо tof 1025 оомодујотиј 42]s1[22,4 ayy *op[durexo 107) 1591 үепртлтрш uv uo обе јејџош suuosiod v ojeumjso ој :sa1aWvxa 
"poujour 32o11p 
олош v Áq рәлїзеәш oq 41915097 jouueo jeg oque owos oj 12odso1 qi snes juosoid $ әәшшехә dy} jo sayeunjsa ApoArsuedxour 
pue *Apyornb “Ац$вә олош штадо о} Á[[ensn 7/527 297 of трилајхо әрт о auos of yoadsaa yum snis qu2s24d. suossad р omwunso ој 


(dav зя OL LNAWOaAL AO AJAL) ONILSAL NI WIV 


SS = 


pjoq uoua jo yuawssassy j»eurpu]—esodung puorag eui әлләс ој pes) sisej Jo uonppi|bA pup џоцопдецој 
| £v гјаој 


117 


919^ asodind ano j| *jsoj |pjoj әң uo ojos ЧИМ џоцојеллоз јзацбиц 
94; рожоце jou, swap фзојог рјпом әм ‘aBD |ojuou ѕ,рјіцә eui jo 
jueussossp pidos p јеб oj шлој poys siy} asn op рашом әм j| “ABD 
190425 Азгфиошаје jo џелруц» ревешор-шола чим asn 10} ISIM our 
jo w10} poys p Вшаојәләр aim әм joy} әшпѕѕо sn деј ‘ajduipxa 104 


“ZL 19805 
“Z96L “эщ ‘suog pup әд ицог :3104 MON) “pa pg '52454045 [02180| 
-Og2Ásd чошемоуј ишп "ојашохе лор 'səysyozs ш! xyooqixoj pippupis 
Kup #95 "(0 40 [+4 зәциә pasors ",Buo1w,, Jo „Аби, лечџг st wap 
чо ој osuodsai sjuopnis p 'ајашохе 10j) snowojoy>!p 91D зејдомра 
Əy; jo qjoq 10 euo џоцм sjualiaoo џоцојаллоз бицпашоэ и! pasinbas 
10^ jo junoum əy; B juu 203 e[|qp]ip^D әлә зопбишузај [pieds ,, 


"оу Əy; рир '5494016 
-baod рәүдшозэ$ ш! so»uojuas обирллоол о} Айүдо 'оџејош фуолр-151у 


Ul sj0119 4281102 pup риу оф Аџјтјо ‘әбоѕп jo ѕѕәщәәлоә :uo suia 
әѕтләр {цбіш [s иоціѕойшоэ jo 459} әлцзә!до ир jo 1офщпо әчү 
"(ролешеб и! ajdoad цим релрашо» sp) uossad зцолпеш əy; Aq Ајша 
-19gip рәзәтѕир eq oj Ajay!) o1 ‘sasayyoddy siy оф Бшрлодр ощ. 
$шәй j2njsuo» рјпом шѕіоцозпәи uo 4s9j D jo Joujnp ayy 'ајашохе 104 


*Бицим Ápssa ио 84005 џоџејџ:»> 

о éejnjjsuoo pjnom sjueubpn| јџерџедери! әзәц uo pəsoq 2105s 
ejsoduo» au 'Аншарџадари! sjuaupn! əy; Бирјош ^seBpn! [озәләѕ 
Aq pes eq I^ uiu jo Ҷэрә 'sKpssa әлош 10 OM} ојл о pe»xso 
aq up> syuapnys 20 '591025 моме шодо о syelqns 40 Једшпи 
D оф џомб од up» DSI рәләѕзішшро Ајјрпрумрш eu 'гјашохе 104 
"isugpiuoAsd ло jsiBojouoAsd piod AjyBiy о jo әш 

Вишелаји eui paou oj Ajax!) jsou suossad әѕоці pəjəs oj земер 
Би!шадло5 Алошштела sp Ajuo pesn ein ѕәлпѕоәш цэп 'AuputpiO y 


Validity 


1501 o1njnsqns 10 jn2-3409s og) JO ÁiIqei[o1un oy} 10j pojo231102 aq jou p[nous sjuorogjoo2 ÁipI[VA '521025 чошоуо [eoe 
Ssjoipoid oinsvour jno-j10gs v yarm YIM ssouoAnoogo oy} jo uorssoidur Surpeo[suu e элі 0) jou se Os uoAIg oq osje рүпоц$ 
ouo pojoolrooun dy} UJAL SI juorogjooo ројооллоз v ЈЕ |'501025 UOA AQLI Апоојлод pue so1oos 1591 поолјод diqsuonv[ 
-91 јроџалод у] IY} So1nsvour juoloiJoO2 ројооллод v YONG '501025 UONI ou) JO Ái[Iqei[piun əy} 10] ројооллоз UIQ элец PYL 
5упогојаоо Áirpi[eA зиәллпоиоо j10doi oj oonovud иошшоо Á[iej st 11 *o[qvr[oun Алол oie 'sdurjei se Qons *0119]112 oulos ooulg "с 


wS[PA9] 91025 juo1ogIp 10у uoAIg oq p]nous әјешцѕә 
JO 10119 рлершеј5 оцу 'so1oos jo uva [NJ oui зпоцёпоіцу owes oq; Á[ojeurxouddv st 1591 әц} JO /ирцел 3uoimouoo ong 55ојип "p 


3891008 1521 
шолу vjep Uon JO soxeur ou зојешцво oy} Jo Апаецол əy} o3pnf oj 1osn oy} Фјоц yey) viep 3uosoid p[nous [enueur 352) ou], ‘є 


"uonoo[os шә од) prp әм YoY 10j dnoig owes əy} uo зиоштлош eq jo Апрпел оцу 80150] 
од р[пол\ ом озпеоод yry Á[snounds aq ріпом dnoi3 тешио əy} 10] sjuoroujooo Апрцел 941, uoys шәп ur posn ouo oui 
uey} sooupuexo jo sdnoi3 10 dnoi3 quosayrp v uo payndwos oq p[nous 5уџогоцјооо озоцј, 'posn oq oj Ајоми st 1591 SI uorq^ 
UO 25091 0} epus әле yey} sdnoi3 10} pojuosoid oq pjnous (521025 UONI рие 1521 u2249q suornv[ol102) sjuaroujooo Аирцед 'Z 


ора uona јо jueuissassy spaulpuj—asoding puo»eg eui 9^19S о резп sso] јо иоцорірд pup џоцопдаиој 
(рәпициод) gv eqni 


THE EVALUATIVE PROCESS 


118 


"£L "9 "из “do oway ur punoj aq «ou и ^4ooqixoi s»tistipis "019 [оиоѕ 
piopupis Аџо и! uoAIB st иоңзәмоз siy} јој ојпшлој ayy ,'uouo -Jad e»npei oj дјец ром лә uy бшшо pup %лојог оф зиоцзала 
-nuein 10} uon»o1105, ƏY} pe|[p» 5! зојдоџра yoq зо ouo jo Аијао Жијошбџо Бијоцаше од Лош ssayjo INS ‘ssoujpeu Ад pospiq 
"je4un 94; 10} фџорјао2 џоцојелоо p Бицзолло> зо} рјпшлој oul, әд Аош siayjo *ѕәјирцэәш jo sway ш Ayatys sdossa Buipos6 eq Аош 

$19jD1 əwos 'o|duipxa 20; 'poipnjs әд spiq 19jD! up Анүдо!|ә1 49104 
"sooujupxo би!оэз-цбіц 10 Виџоз-мој uns 10} $әлоэ$ o[pos-||nj бицош оф JUGE Meses a а ЫЫ] = uk del. m 
7Ns9 40j 51509 рол D oinilisuo» jou рјпом шлој јама aui ‘os у! 'обиол Əy} pup sj910»5 иәәмџәд бијрозб jo Áouejsisuo» 'л81025 jo бшшоц 
9105; oQj jo əwəyxə YBIY 10 мој oq лоциг о Аџуфоцел фор 'sÁpssa 103 sədo; jo мођзаја5 Бипилазиоз јопиош 450} әщ ш чәл! 


ми чим олпзоәш {бш psa} [ojuour D jo шој јаџа p 'ојашохо 104, 94 pjnoys моцошлојџ *osp» sayo) əy; и! 'әзиәң чои soy ,,sXossa uo 
оролб ojsoduo», p sp uouoju» p yəns 'peusiqoise әм uaaq spy 


"DIOP иомәриэ 10} ejnjusqns (п2-1045 о 
An|iqo [оұиәш jo uoneju» p sp DSIM 9ui јо Азопберо eu, чбпоцују , 


SD posn 1D DOP 459} j| uonpogissp[osiu jo рооцуежј əy} Buru1o2uo2 


чәл!В әд p[nous моцошлојш '(jpuiou рир ребошор-шола 10 'јошлоџ чабојолпеџ р kq uoypəyissoj> jo 
pub »jo1nau so qns) po1oduio» uoaq олоц sdnoiB uousjus И “AIDA OF 51509 Əy; uo ‘dnos6 |puuou 10 „робеошор-шоза,, әщ ш diyssaquow 
pojadxe eq up» so1o55 џоџејиз YYA UHA „виш! e»uapguo», eui әд pjno^ uoniaju» ano 'osp» лошо eui ut 5S[A јофој әщ uo олоз5 eq 
mouy UD au JOY} os 'ојошцзо jo 10108 piopupjs oq Бијилезиоз uou Ріпом џоџаџзз sno ^os» euo uj 'иәзрүцэ јошлош рир peBpupp-uipiq 
-ошлојш! uaAIB oq pjnous Əy 'uoisjoA 294105 D UO 521025 ƏY} шол} uəəməq иоцоциәзәјір шпшіхош sn әлоб uou sua pəjəas рјпом 
нәр jojo} ayp uo 5,01 ,siuopnis Бицошцзо sı sasn ayy yt 'ајашохе 104, ем ‘әбошор шю BursouBoip ш PID рјпом {ощ 4594 D Чојелор oj 


R 


pog чомәџіо jo jueussessy #əpuj—əsodıind риозәс eui еллеб ој pes) 5,52] јо чонорцед pub џоцзпдаџоју 
(ргпицио5) £t гјаој 


Validity 119 


certain items in intelligence tests for young children had concurrent validity 
(in the sense of being related to other evidences of ability at that time); 
while different items proved to have greater predictive validity (in that 
they were associated with high future performance). 

Presentation of concurrent validity data for aptitude tests does not 
relieve the authors and publishers of their responsibility to follow up stu- 
dents, as they go into college and into vocations, and present data on 
predictive validity. Since the practice of substituting concurrent for pre- 
dictive validity data is fairly common, the Technical Recommendations 
have specfied that "Reports of concurrent validity should be so described 
that the reader will not regard them as establishing predictive validity.” 


PREDICTIVE VALIDITY 


Illustrative Uses of Tests for the Third Purpose—To Predict 

Future Performance 

r the selection of students likely to succeed in a 
m is concerned with test scores as aids in 
Counselors also use tests as predictors, but 
Jection decisions. In some school situa- 
ment of students in different ability 


A person responsible fo 
given job, college, or curriculu 
doing a better job of selection. 
usually in placement rather than 5е 


tions, counselors recommend the place 
groups; in such cases, they will often use data from more than one test. 


Counselors also are “predicting” (or helping the student to predict) from 
test data whenever scores on aptitude and interest tests are interpreted in 
terms of probable chances of succeeding in different colleges or in 


different vocations. 

Primary grade teachers use reading г 
predictors when they use them as aids 5 ven S ; 
moving, slow-moving, and average groups for im pirucHon: Jm reading. 
Teachers use tests as predictors whenever they utilize test data in making 
decisions on the assignment of students to remedial or accelerated groups, 
or to the use of instructional materials that are either below or above grade 


level. All such judgments involve predictions regarding rate of progress 


or chances of success. 


eadiness and intelligence tests as 
in grouping children into rapid- 


Test Scores as Predictors of Future Criterion Performance 


The predictive validity of a test cannot be judged by an examination 
of its content, The basic procedure in studying the predictive validity of a 


15 “Technical Recommendations for Psychological Tests and Diagnostic Tech- 


niques,” op. cit., pp. 201-238. 


THE EVALUATIVE PROCESS 
120 


test is (1) to administer the test to a group of students or prospective 
employees, (2) follow them up and obtain data for each person on ope 
criterion measure of his later success, and (3) compute a coefficient о 
correlation between individuals’ test scores and their criterion scores, which 
may represent success in college, in a specific training program, or on the 
job. Such a coefficient of correlation may be called a predictive validity 
coefficient. We can interpret predictive validity coefficients in terms of the 
standard error of estimate! of predicted scores. The formula for standard 
error of estimate for predicted criterion scores is as follows: 


SE.riterion scores = SD predictor ЎТ = Ff 


where r is the predictive validity coefficient. 

When we use this method of interpretation with typical validity coeffi- 
cients (which usually range in size from .3 to .6), it seems that most pre- 
dictor tests make little contribution to our accuracy of prediction. As an 
illustration, we will compute the standard error of estimate for a fairly 
high validity coefficient of .60 between a scholastic aptitude test and some 
criterion of success in a training program (such as grade-point-average). 

ТЕ T-scores were used for both test and criterion, so that SD would be 
10, we would obtain the following standard errors of estimate: 


If r — .60 
SEcriterion scores == 10 v1 — (.60)? = 10 М1 — 36 


= 10 v.64 = 10 (.8) = 8.0 
Iir=0 


SEcriterton scores = 10 VI — 0 = 10 У(1.0) = 10.0 


In other words, if we based our predictions on a test that had a validity 
coefficient of .60, rather than on one with no predictive validity, we would 
have reduced our standard error of estimate from 10 points to 8 points, or 
only 20 percent. 

The formula for index of forecasting efficiency," which is based on this 
type of comparison, gives us a value of 20 percent. In other words, we 
reduce our error of prediction by only 20 percent when the validity co- 
efficient is .60. Since we realize that predictive validity coefficients are 
seldom this high, the contribution of test scores to the prediction process 
seems very unpromising. 

It is important, however, that we realize that this index is based on a 
comparison of predicted and actual scores, while our predictions in edu- 


16 See Chapter 3, page 82. 


ко ES = of forecasting efficiency is as follows: Е = 100 (1 — 
zu ; P. Guilford, Fundamental Statistics in Psychology and Education 
(New York: McGraw-Hill Book Compa 


ny, Inc., 1956), pp. 375-378. 


Validity 21 


cation and personnel work seldom require prediction of precise scores for 
individuals. We are usually satisfied with cruder predictions, such as the 
student's chance of achieving an acceptable rate in typewriting or short- 
hand, or his chances of making at least a C average in college. 


'The Use of Expectancy Tables in Interpreting Predictive Validity Data 


Table 4.4 helps us to assess тоге realistically the value of a predictor 
test for helping us make the third type of judgment. 


Table 4.4 
Improvement in the Prediction of a Student's Chances for Success 
When One Bases One’s Predictions on a Test That Has a Correlation of 
50 or .60 with the Criterion Score and the Success Ratio is 50 percent? 


СТЕР СНАМСЕ$ ОЕ SUCCESS У$. FAILURE 


PREDI 
WHEN 
No information Predictor tests used that 
Student's standing on test available to aid correlate with criterion 
90-99th 10 1101 5101 9101 
80-89th 9 1101 3101 4101 
70-79th 8 1001 2101 2101 
60-69th 1 1101 1101 2101 
50-59th 6 1tol 101 1101 
40-49th 5 1101 1101 1101 
30-39th 4 1to1 1001 1102 
20-29th 3 1101 1102 1102 
10-19 2 1101 1103 1104 
1101 1:05 1109 


m 


1- 9th 
"Better than Chance,” Test 


of the publishers, from 
al Corporation, May, 1953), Table IIl. When 


of local data of this type, he can construct 
coefficient and (2) the local percentage 
n R. W. B. Jackson and A. J. Phillips, 
f Relationship," Educational Research 


Source: Adapted, with the permission 
Service Bulletin No. 45 (New York: The Psychologic 
а test user obtains a validity coefficient on the basis 
tables similar to this one on the basis of (1) the validity к 
of successes and failures. Ог ће could use tables given ! 
"Prediction Efficiencies by Deciles for Various Degrees 97 
Series No. 11, Department of Educational Research (Ontario College of Education, University of 


Toronto, 1945). 

succeed on the criterion; for example, this table 
"success in the freshman year of college" if one- 
ed by а grade-point average of C or better). 


^ lt is assumed that one-half the students 
could be appropriately used for predicting 
half of students attain such success (as defin 


122 THE EVALUATIVE PROCESS 


Although Table 4.4 has more general application, let us assume that m 
are using an admission test of scholastic aptitude to predict a student's 
sei bilit of making a C average in college when such an average is 
T sess A only one-half of the students. If we had no information about 
the student's scholastic aptitude, or if our test correlated .00 with the cri- 
terion, the best estimate for students in each decile (on scholastic aptitude) 
would be that one-half would "succeed" and one-half “fail” (with a C 
average or better being used as our definition of success). In ot 
their chance of success vs. failure would be 1 to 1. 

If we use a scholastic aptitude test with a validity coefficient of .50, the 


her words, 


(the range of predictor 
alf fail on the criterion), the 
ions. If a student's Score on 
5, his chances of success are 


When we examine the last column of Table 4.4, we see that the seem- 


efficient from .50 to .60 is reflected in 


considerably greater accuracy of prediction. The range of predictor scores, 


ution to prediction, is reduced 


Validity 123 


Correction of Validity Coefficients for Preselection 


Validity coefficients of .50—.60 are typical for scholastic aptitude tests 
used at the high school level. Coefficients of this size, however, are seldom 
obtained when scholastic aptitude tests are correlated with grade-point 
average in college, especially if the college has high admission standards. 
Many of the applicants who would have obtained low criterion scores are 
eliminated in the selection process. That is, preselection of students results 
in a validity coefficient that is spuriously low. 


Since predictions are to be made for the more 
applicants, the validity coefficient should be corrected for the homogeneity 


of the validation sample. For example, if the SD of the freshmen (on the 
scholastic aptitude test) is only .6 as large as that for the applicants, the 
validity coefficient should be corrected for “restriction of range"; for ex- 
ample, a validity coefficient of only .41 for the selected freshmen would 
be corrected to .60; this corrected coefficient is an estimate of the validity 
coefficient which would have been obtained if all applicants had been ad- 
mitted.'* This coefficient would be more suitable than the uncorrected one 
to use in constructing a probability table like Table 4.4, for use with 


unselected applicants. 


heterogeneous group of 


Importance of Obtaining Local Data Regarding Predictive Validity 

When we are using tests to predict, it is very important that we obtain 
local validation data. In other words, we should have data that reflect local 
conditions, grading practices, and the like for our own group of employees 


ог students. Test manuals should provide data about the predictive validity 
fferent types of well-described valida- 


of the test for typical uses and for di 1 s О ‹ 
tion groups. Thus, оп the basis of predict validity data in test manuals, 
the test user can select one or more tests that seem promising for his local 
situation. The validity of the test for predicting grades or other local cri- 
terion data, however, is hypothetical until we verify the test's predictive 


validity on the basis of local data. . 
Local expectancy tables are much more meaningful and acceptable than 


similar data from a test manual to those students, parents, or employees 
to whom test data are being interpreted in terms of probability of success. 
Local predictions should be based on the cumulation of considerable data, 
however, since the sampling error involved in computing correlation is sub- 


18 A table for correcting values of r for "restriction in range" is given in most 
standard textbooks in statistics, for example, Quinn McNemar, Psychological Sta- 


listics, 3d ed. (New York: John Wiley and Sons, Inc., 1962). p. 144. 


THE EVALUATIVE PROCESS 


124 


51521 juourfo]duro Jo uorssiupe əy} uo po1oos sjueo 
-Hdde зполодир Moy моцу S1921 ƏY} jt дине UOHI SYL Á[qviAour попешшејиоо Yong ‘иорешшејио поџејо роле 
duo jeg; juejiodum sp 11 'so100s uonapo se pasn әле sjueuidpní oAnoefqns ошо 20 'sopeiS „5лоцовој 'sgunei ,510514:әйпѕ j[ '2 

р'ӣпо218 uonepi[eA ay} ur uoroo[oso1d роле 19100 }] а 


o'UOLI9]IIO 
apun əy} uo sso»ons О} JEPNIƏ SWS Fey} uorio]Ho әјерәшләјш ue uo tjep oAnooíqo штејдо uvo ouo 'soseo Áuew UT в 


`зәәшшехә Joy vjep цопѕ UJO 'o|qe[rvA? әреш aq ULI сетер uolo] рәлѕәр oy} se uoos sy 'p 


"uorssojoud Suniseurdua ay} 19jue ој Surpuojur syuapnys 'одојјоз 0} uomssmupe 107 Sur[dde sjuopmis оошо 
-wə Joy sjuvoydde ‘ajdwexa 10у—әјдепеле әд лојер [им vp uona шоцм Јој suosiəd oj 1521 Áieurunpoid ou] iojsmurui ; 
J т | ј—оја ер q тШ p uona Ч mu Ч топвру ‘E 


"UON ојетрошлојит jueoprugis v JO (лодлео Џо5оцо v ШІ оопешлојлод әлпоәрә se YƏNS) UONI ојеш 
"ја əy} uo ssadons oj ројејол sarge Че} oj Ајоми sway jnoge sasayoddy jo siseq oy) uo sway 152] Jo лодшпи ose] v әлә “Z 


"Жирцел олпогрола Чим 
suy 150] jo Surdojaaap ou] ur sdjay osje озиешлојлод uono Jo Ápnis |пјалео v wəjqesrape oq рјпом злодлом [nJssooonsun 
рив [njssooons ojenuo1oprp Jey} sonsHojovieuo Jo Ápnis v pue 104 ,s2oKo]durs ou) jo sısÁjeue qof в ‘qof оцј uo o»uvunoj1ad 
]njssaoons jorpaid oj Зима) st ouo јр "(152) Виперцел ur рәѕп oq әуе [IM YAM) 142/42 гппрашлоји) олош 10 ouo Suido[oaop 
10 Випогјав ur day о; (os1noo oyroods v ur 10 *odo[[oo ur “qof ayy uo sseoons ‘st јер) woj427142 punn oq) jo əmu og) APMIS ‘ү 


asOdund SIHL МОЯ 51581 AO лмамаолалла AHL моя SAUNGTIOONd GSZIIVHdNSO 


"exi oy} pue :[ouuoslod Ájrsse[o 
pue jos[os о} soorA1as powe оф pue 'Ksnpur 'ssournq ur sjuouniedop [ouuosiod Aq podo[oop 51521 [eroods Áuew :sonojuoAul 


jso1ojur ‘sye әшу ШІ 51501 opninde [eroods *(exi оф pue Puey ea) 51520] opnjndve [euonvooA '51591 opninde оп5ејоцос 
aSOduNd SIHL моз GASA SISIL AO 58441 


"(зрела puooos jo pua oy} je) 1}йәшәләцәв JUPE 5,полрицо '521025 ssoulpeol 
Suipear шолу Чогрола ој ‘uorssrupe әЗәцоо JOY шәлі попешшехә Jejus е ш олоз5 ца sjuopnis моц '501025 1521 
apninde onse[ouos әрел8-үүйә шолу "jorpaid oj :e1qe3[e ur зорела ,5упорпј5 '521025 150] опошцуле шолу 3orpoid ој :5ялайухя 


2|QUI4DA јрилојхд IUOS UO 10 1501 21 UO гоипшлојлга әлпту S,uosiad р paid ој 
(зау яя OL 1мяйоапг яо IAAL) ONLLSRL NI WIV 
e»upuuoged BNN Buuaipe1g—esodang P4ly] əy; әллә ој резр sisa[ jo џоцорцеод PUD иоцэпцѕиоо 
су 9Jqo1 


125 


Validity 


eup|du 'upuia10 лоџојш 10 sonadns p sp иоцо2ціѕѕојо s,aedojdwa 
UD ој рај ұоці sjuopioui prodas оф poxsp әло suosiAsadns џецј *|njsse» 
-эпзип AjyBiy pup njsso»ons AjyBiy eq оф pajepisuoo әлә oym suosiad 


ѕирәш ләщо әшо$ ло 'рлозел џоцопрола 'бицол yBnosyy 'seyiuopr 
euo ^1 jp] „„заџнош poy; $әзиәлзәў}!р,„, бшішзәэиоэ sasayyoddy Видо 
-әләр ш djau oj pasn Ацџепба s! uopoiddp ,juepru! јозциз, әчү, 


———————-———————_ 


7159) лојотрола əy} јо Ауцдъцәлип əy} 10} ројоолло2 әд зои p[noUs 5употошоо 
AypyeA 'so102s uonapo [enioe Фипотрала ur 152] oui JO 55ошалпоојјо oq] Juruoouoo лозп eanoadsosd oy} реојзиш оў jou se OS 


чәл! oq osje p]nous јџогоцјооз рајзоллооип oq) UONI ou) јо Ауцдецәлип 20у P3991109 әле sjuoroujooo Áipi[eA oAnompoud 3[ °$ 


x'S[2A9| 21025 зшәләрір 1ој пәлі oq p[nous ојешу5о 
JO 10119 prepurjs oy) 'so100s Jo oJuei [nj əy} упоцдполу) owes oy} Á[oreunxojdde st 150] e Jo Aypyea oAnorpoud оф ssopu) "p 


('501025 помало 01 BBP 1591 
lOjopoid шолј ѕәҳеш oq suonorpoid əy} Jo Куцідецәг eu) e3pní ој лозп ay} әдә yey) vep әрілоза рјпоце [enueur 159] L 'Е 


учопепиз језор 514 ш 521025 UONI рие 521025 lOjorpoid uooA9Q 
diusuonv[o1 оцу Sur&pnis Aq 2141504 se uoos se stsoyjodAy sig yno 1501 Pinoys oy ‘uonenys jesoj siy ut хојотрола o[qen[eA v oq 
ША\ 159) oy) jeu) sivoddv 11 уу "uonenis umo siy оў тер Амрува моцу ozieiouod ues ou yey, Џем Ápuoroujns sdnoià поперпел 
оцу jo ouo зојашовол dnoiS siy 1oqjou« әрпі ue лозп oui сці 05 шәлі oq Pinoys sdnoi8 uonepieA Surpiego1 vjep jueroujng 
"uonoo[os WIT ur posn ouo əy} uey} 5узоГап5 jo dnoi3 1иәлә}їр v uo рәшездо aq Pinoys 
sjuorujooo Ќирцел ou] «qsdnoid иоперцел үезәләѕ JO цово 10j uon) QUO uey, ow juisn—serpmis иорерцел jo iequnu e 
ayeu p[nous JOYINE 152) ay} 'suonenis Jo ALVA v ur vjep uonojuo jorpaid oj posn од [It^ 1521 лојотрола pozipiepuvis v IUIS ‘т 


ч 


1591 V AO ALANYA AALLOIGAUd AHL ONIGuVOgdH IVANVW 1591 V NI азалоча AA GINOHS LVHI NOLLYWUOANI 


. s'Sosodund oytoads 10j so1oosqns juo 
-19]Ip о} siugToA 1524 oui uSisse *sorpmis uonepi[eA [eoo[ зо siseq 24} uo ‘ues лодојфшо оуіозйѕ v 10 [oouos ogroeds v uou] 


'sjsojqns [njgurueour ‘snoauadowoy ozu; podnosd oq иеэ опјел олпотрола y3iy әлец Jey} swa o[q seog 51 159] лодиој v JI 'q 
"mum eum 
1ouous əy} Јој опјел олпотрола шпштхеш Jad ом Хем sy} UT “SWI loujo YM MOT pue 521025 UONI Чим YIr oje[o1100 
jeu) poioo[os әд pjnous sway! *(иоцецѕішшре 150] 10j our powy SI 3194} J! ‘st ey?) 11045 aq 0} spoou 1591 10]01рәі v Д "v 
;59100s UOHƏJƏ рив 521025 1521 лојотрола usamjaq Фтц5џопејал eu) ozmumxeur тец} 3501 рәзїлә1 oY} 107 SWI 129]9S '9 


s'€jep UONI 0} иоцејәг ur 152] лојогрола oy} JO шәп yova ио oouvunojied әәшшехә Ápmg `ç 


езиршлојлод eunjnj Burpipejg—esodung psy] eui әллә ој pəsn sise| jo ионорцод pup uoijnaisuo 
(ргпицио5) S'y әдә] 


THE EVALUATIVE PROCESS 


126 


584 по15 


иоцорүол |D19A8s jo YIDa јој 559; pozipiopupis əlow JO OA ио 
5алоз5 juoujoAe!u»D бшроәг рир 5а025 4SƏ} s! UBaMjaq suonp[o1102 
џодел pjnoys 459; ѕѕәшроәл Bulpoas о jo joujno eu, "әјіашохә 404, 


*uouanpouda4 a1ydosBojoyd 104 Ado» yajiad addy о; 
ѕәәҝојішә бицәәјәѕ ш K3p1n32D 0j фублом 1340916 Ајалцојал рир spso 
+01 əuoydopıp шоз} Ado» уозр-ублол addy о $үопр!л!ри! бицэәјәѕ 
u paads oj yyBiam 1910016 Ајалцојел әл!б рјпом ouo "Киро Buyum 
-adAy jo 4s} D ш sa102s Азолпар puo poods poy auo j; 'ојашохо 104, 


4594 ұџшәшәләіцэр Bulppas оролб-рџоза5 əy; uo 591025 


s,ueipiuo Чим uonp[o4102 4ѕәцб!ц əy; moys poy} ѕшәџ 2504; Ajuo әр 


ззошроол бироа posiAoJ D и! opnj»u| 'ј00ц25 Биџаашбиг и! әбоғәло 


qurod-appiB иошцѕәлу Чим иоцојајо2 jseuBi| ayy moys poy} шәй 
osoui Ajuo 4594 opniudo Bui. әәшбиә рә$!ләз D и! opnj» 'ајашохе 104 B 


"exi Əy; puo “до! oq uo spjo2a иоцопроза мој 

10 yBiy олоц 'Joou»s бшшәәшбиә и! [оу 40 pae»»ns 19jD| oym ѕәәш 
-шохә јој WO}! js9| Ҷэрә uo езиршлојлод oui әзойшоз /әјішохә 104, 

,,'9Bup1 ш иоцощ$әз„ 10} 


'dno1B pəpəjəs о uo payndwos '%иеруао Дирол мој A|snounds 
'serijod jooyrs 40 Aundwo> jo esno»eq моцзеје5 


оцу 1201102 {ұѕпш ouo 
-aid ріоло jouup auo j| `!о} JO poo»ns oi Aj9xi 240 oym suosiod 
зәуциәрі 459; әш Ајолцзеја MOY uo жрец) auo иро Ком suy ш 

suosjad 


дио "qo! əy; оф 10 обајоз әң or әјішоѕ иоцорцол eui и! 
џоцзеје5 jojo euros Jo 'uon»e[as 


по ишро о; 4599 st и 'цоцоп, 

әәҝојашә 'uoissiupp обајјо» 104 peiopi[p^ Bureq s! 4504 Mau D UAM p 
"Buiu2pe, ui 55822п5 

м и 'Алоџллој рооб Ajjonba ир о 


Joy ионәйзә 1008 Азәл D eq pino 
popi^oid “лом ѕәјоѕ јој uona 


pauBissD әләм иошѕәјоѕ (259 {оці 


рооб о aq іҷбіш pauina Áeuou jo junouy "qol əy} uo sse»»ns 
jo sBunoi eAn»elqns щоці jsej Азџарцола pupypoys о 404 uon» 
ленед 203 D eq {біш Buniodei ino» jo uo!ssajoid ayy оф uoissiupp 
souay 'dnoiB pejps Ајуб! siy} upm збицом 5582205 цим Дуб 
офојаллоз you Бш “зәмойәз pno p sp jueuo|due Бишодо и! 
jpn uBnouip '521025 paeds punyysoyg 'амодал поз D sp $$әээп$ 
qol jo uona ejouna ayy jo 4oyipesd рооб р әд jou 1ҷбіш 10 
4ҷбіш рәәаѕ иоцор!р риощіоцс "uoissojoid əy; и! ом оу әзирцә 
D 9ADQ jou jm euo "шпјпзмлпо Биџаешбиг ay} ‘ul pee»ons pup 
ʻo} радишро aq up» euo ssajun 104 'Виџегшбиг ш! ssa»ons 5! џомаји» 
әјошцја əy; uəym uonoju» әро!рәшзәри! рооб о sejnijsuo» jooyəs 
Биџаашбиг jo 1024 {иу eui Биџпр гболало ju д-аролб 'e|duioxa 404 


o 


7991 "d (1961 “эш ‘suog рио Ae] иҷог пород 
мем) uono»np3 рио ABojoyrdsq ш uojon[p^g рио juawainsDay 
'usboH ujeqozi3 рио ежрилоцј “7 медом "(ex Əy} puo ‘оор 
џомаџшз зо} иол оў soy ouo әш jo бичә 4502 “әзиә!шәлиоз) Кирдор 
поло (p) pup 'Хијнаопјел (g) :(ѕ2әцорә; jo sp1opupis Buipp4B Jo 5лајол 
jo Аџзолашоб ш сиоцоџрл ој 4299502 ЧИМ 5019 10 ‘palo 2:44 06096 
D jo јоциәцоа səjos о juawdinba jo Аџјопб sp yans suonipuo» Bui 
оџом ој j2adsa1 ui^ 5019) 5019 woaz шорәәз} (Z) :(Рјәу |p:2ads әшоѕ 
ш! {иәрп$ 10 12310 so qol jojo} uo sse»ons ||о-лало зошшлајар уо 
510/20) әшоѕ ayy Aq pəpəyo s! олп5одш џомајо UO 91025 ЭЦА OF 
4uajxo ayy 10) езџрлајел (|) :5мо|јој 50 'e»upjodu jo Japso и! ‘ain 
-$рәш uoau D и! e[qpursep зашијопб eui sı) чабон рио exipuiou] q 

'8e£-Zz£ "dd (усе Дог) 15 
гол 'ицејјпд |o2:180Jou24sd „гепбршцзеј siuepmu| [PIV әці,, ‘иобо 
-up|j `2 ицог ur pequosop рио иобоџоја kq pedoje^ep som enbiu 
-цзеф siy} "әлцэә!до |ошоцозпрә UD jo уиәшиш!рцо jo [әлә мој 20 
46:14 о бшмоцѕ sjuepnys jo грош eq рјпоз Ápnis лоршњ у 'разлор 
Buraq s! {зә} моцзаје5 au uiu 104 огЛојаша jo adAy лао 10 ора 


_ n C C 
зоџошлојлодј 91414 Buuoipe1g—esodang paty, Əy} оллеб ој резп sise| jo чонорцед PUD иоцопцѕиоо 
(penuuuo5) S'y e|qp1 


127 


Validity 


"олдебјо 
ч! ұшәшәләцою seyBiy ҷим рајорово oq jou 4uBiur моцојлашо» 
шу Азрлп2оо jo sjuoujgJ2u| DuispoJou! 'лаломоц '581025 обололо рио 
мој jo обиол əy, uym олдабјо ш sso»ons 40} Айр!үол әлцэ!рәга 
yBiy Áo} олоц juBru иоцорпашоз зцешцуло ur {594 D "ојашохе 104 


"uoup3yissp|2siu jo хз eui 3402 
рш yoy} рәрілоза oq pjnoys оцор ЧиәпЬицәриои рио juenbutop so 
yons ‘ѕәмобәұо2 и! diysuaquaw jo suo, ш! грош 220 suoyripasd j| 
'so100$ uoniaju2 зәщцо 10 'spi0201 uononpoud '5абозело 1и108-оролб 
әп} Бицошцзо 40} ојпшлој uonpipeid рәұѕәббпѕ Аир 20; чәл!В 
әд pjnous ојошцзо jo Joa pjppupis OYJ '591025 {$әрзоэ!рә1Ч и! 
ебиол шоџаз D 20} 591025 џомој> e|qissod jo eBubi ayy ејолрш 
доц; papnpur eq уцбиш gy әүдоу әх! 59140 kəuopədxə 'гјашохе 104 А 


`6ў9 "d '(196] "^oN) 91 "јод 
4ysiBojoyrksg ирэрәшу әчү „род 9g 5159]. ||V 45пу/,, |9q3 "1 медом 
1/2950 4804 oui jo Án]iqisuodso1 D Аүдойләш! зешозед иоцорцол /45ә4 
D jo sesn ejqissod jo Аџлојош əy; лод “ASN jo ѕиоцірио> pojpiisol 
Дүүп}әзоэ pup peurmeds Apmep шјоме» sopun Аџруол модел ор sı 
op up əy iseq әчү jÁAuprp^ uo pipp ајопберо Дүп} usi|qnd |915 
-sod лоцупо 459; D UD? MOY 'suouipuoo әѕәщ 1әриг “pajiodas Јощпо 
Əy; uoy; pijoa sso| Jo елош aq Кош {әу Əy; poy; ѕирәш siy} шобу 
ҷшәзәјір eq Аош uoua si ‘puw ur pou зощпр 459} Əy} иощ 
Buysa; јој asodind juasayip уоцмашог D әлоц џала Кош језп ayy 
A|qissod гипђ "поб 4noÁu 5,1оцупр 459} Əy} 10} SOM 4! UDU pl|DA 
ssa] 10 əsow eq Лош səy əy; Чполб ѕ,ләѕп Аир 104, *S459} 101P 
-oid jo seipnjs uoyopijoa јозој Битјош 104 paeu ayy ѕәѕѕәдѕ Jeqq , 


О 


ариршлојлед e1njnj Bunorpeug—esodung pary, eui әллә ој рәс sisə, јо uouppi[bA pup иоцэпцѕиоо 
(репициођ) S'y әјә 


* 


THE EVALUATIVE PROCESS 
128 


stantial unless predictor and criterion scores on at least 100 cases and 
preferably 300 or more cases, are available. 


Summary of Procedures Involved in Constructing and Validating Tests 
for the Third Purpose 


Table 4.5 not only gives further examples of the use of tests for predic- 
tion but specifies the procedures that should be followed in developing a 
test designed to predict the future performance of individuals. In the last 
section of the table are listed the types of information that should be given 
in the manual of a published test so that the user can judge the probable 
predictive validity of the test for his own purposes. In making local validity 
studies, the suggestions regarding criterion data outlined in paragraph 4 
(of the section on procedures for test development) should be followed. 


CONSTRUCT VALIDITY 


In our discussion of content validity, we focused our attention on achieve- 
ment tests—tests that were intended to be representative samples of a uni- 
verse of content or skills. In our consideration of concurrent and predictive 
validity, we were concerned chiefly with tests to be used in practical prob- 
lems of selection and prediction—in estimating criterion data from test 
data. Some tests and inventories, however, are not samples of a defined 
universe nor are they designed to predict specific criteria. They presume 
to measure the degree to which individuals possess some trait or con- 
struct.'? Since tests that presume to measure the same trait frequently show 
low intercorrelations, we obviously cannot assume that test names accu- 
rately describe the dimension measured. 


In appraising a test designed to measure a trait"? or construct, we are 


19 According to Cronbach and Meehl, a construct has three essential character- 
istics: (1) it is a postulated attribute assumed to be reflected in test performance, 
(2) it has predictive properties (persons who possess this attribute will in situation 
X act in manner Y with a stated probability), and (3) the meaning of each construct 
is given by the laws in which it occurs, with the result that clarity of knowledge of 
the construct is a positive function of that set of laws, termed a "nomological net." 
Lee J. Cronbach and Paul E. Meehl, "Construct Validity in Psychological Tests," 
Psychological Bulletin, vol. 52 (June 1955), pp. 281-302. 

20 The term "trait" is used in its broadest sense to include abilities, attitudes, per- 
sonality dimensions, and the like. Cureton defines trait as follows: “When the item 
scores of a set of test-item performances correlate substantially and more or less 
uniformly with one another, the sum of the item scores (the summary score or test 
score) has been termed a quasi-measurement of ‘whatever,’ in the reaction-systems 
of the individuals, is evoked in common by the test items as presented in the test 


Validity 129 


concerned with all types of evidence that make the interpretation of test 
scores more meaningful, that help us to understand what the scores signify. 


"Construct validity is an analysis of the meaning of test scores in terms 


of psychological concepts.”*? In their consideration of construct validity, 


the Technical Recommendations specify that “the manual should report all 
available information which will assist the user in determining what psycho- 
logical attributes account for variance in test scores."?? In this sense of the 
word, construct validity includes the three other types. In many test-use 
situations, we do not care whether an achievement test measures а single 
ability or some unknown combination of several abilities, provided that the 
“test as a whole” fairly represents the universe we are sampling. Similarly, 
we do not care whether a group intelligence test measures an unknown 
combination of several abilities if it correlates highly with the Stanford- 
Binet or if it predicts a significant criterion, such as grade-point average. 

For certain purposes, however, such as the testing of hypotheses through 
research, we would rather measure purer, single-factor traits, However, 
since many interrelated factors affect human behavior, we cannot find cor- 
respondingly pure criteria. For example, we might attempt to study the 
construct of “social introversion" Or shyness. Some persons who rank high 
on this “construct” may actually appear unsociable; others may engage in 
an average number of social activities but may enjoy them less and ex- 
perience more emotional stress in doing so. Hence, scores in any test of 
“social introversion” would not be consistently associated with, or highly 
correlated with, any single criterion. However, we might be able to make 
several hypotheses about ways in which individuals who are socially re- 
tiring would differ from those who are not—with respect to types of occu- 
pations pursued, leadership or followership roles assumed, symptoms of 
emotional stress when engaged in social activities, ог behavior in experi- 
mental situations that allow opportunities to measure suggestibility, initia- 
tive in a leaderless group discussion, and other factors presumably related 


to the construct. В і 
Several criteria, rather than а single criterion, are used, No single meas- 


ure has status as the criterion for the trait. When the differences between 


situation. This ‘whatever’ may be termed a ‘trait? The existence of the ‘trait’ is 
demonstrated by the fact that the item scores possess some considerable degree of 
homogeneity; that is, they measure in some substantial degree the ‘same thing: We 
term this ‘thing’ the trait.” Edward E. Cureton, "Validity," in E. Р. Lindquist, ed., 
Educational Measurement (Washington, р.С.: American Council on Education, 


1951), p. 648. ЧАР 
21Lee J. Cronbach and Paul E. Meehl, “Construct Validity in Psychological 


Tests,” Psychological Bulletin, vol. 52 (June 1955), PP- 281-302. 5 А 
22 "Technical Recommendations for Psychological Tests and Diagnostic Tech- 


niques," op. cit., p. 227. 


130 THE EVALUATIVE PROCESS 


high-scoring and low-scoring groups are in the predicted direction, the 
findings support both the construct validity of the test and our theory 
concerning the nature of the trait and its correlates in actual behavior. 
When the differences are not in the predicted direction, further evidence 
is needed before we can interpret whether we should question (1) the 
validity of the test, (2) our theories about the nature of the trait and related 
behavior, or both. 


An Illustration of a Standardized Test Designed to Serve the Fourth Purpose 


Many scholastic aptitude tests are designed to serve the fourth purpose 
(the measurement of individual differences with respect to hypothesized 
traits). In appraising such a test of general intelligence, we are concerned 
with all four types of test validity. 

We peruse the manual for information on the test 
the author's definition of intelligence and the criteria h 
ing or excluding items. For example, Lorge and TI 
manual for their intelligence tests: "The tests are avowedly measures of 
abstract intelligence." In discussing the content validity of Гогре-Тћота ке 
Intelligence Tests, the authors indicate that the following characteristics of 
their test items elicit behavior that they would describe as intelligent: 


s content validity — 
e has used in includ- 
horndike state in the 


The tasks deal with abstract and general concepts. 
In most cases, the tasks require the interpretation and use of symbols, 


In large part, it is the relationships among concepts and symbols with which 
the examinee must deal. 


The tasks require the examinee to be flexible i 
cepts and symbols. 


5. Experience must be used in new patterns. 
6. Power in working with abstract materials is emphasized, rather than speed.?? 


[p 


> 


n his basis for organizing con- 


The authors clarify that there are man 
telligent, which akes no attempt 
in situations in- 


differently than low-scoring students ns requiring intelligent 
behavior; for example, high-scoring examinees earned better grades and 
23 Irving Lorge and Robert Thorndike, Technical 


Manual, The Lorge-Thorndike 
Intelligence Tests (Boston: Houghton Mifflin Company, 1962), p. 14. 


Validity 334 


made higher scores on the Stanford-Binet and on other tests of intelli- 
gence. Hence, we see that data classifiable under other types of validity 
support the construct validity of this test as a measure of "abstract intel- 
ligence” in that high test scores are associated with superior performance 
on several criteria of intelligent behavior. 

Despite all the data presented on the first three types of validity, the test 
user needs additional information that will help him to judge whether the 
test is measuring the construct of “abstract intelligence.” For example, he 
would like to know the effects of practice or special coaching on test 
scores. If practice has an immediate and sizable effect on test scores, one 
would doubt whether the test measures stable, underlying abilities. The 
Lorge-Thorndike manual reports no studies on coaching; studies of the 
effects of practice, however, revealed that retesting after a week results in 
no average gain for the verbal battery, but an average gain of eight 10 
points in the nonverbal battery. 

Further evidence concerning the construct validity of t 
from a factor-analysis study of the subtests. The technic 
many tests include tables of factor loadings, which the test user needs to 
interpret if he is to judge the value of the test for his purposes. The prob- 
lems of construct validity cannot be adequately considered without some 
understanding of this important method of making test scores more mean- 
ingful (through systematic study of the pattern of а test’s correlations with 
other test data and/or nontest data, for the same individuals). 


his test comes 
al manuals of 


The Use of Factor Analysis in Studies of Construct Validity 


edures involved in factor analysis 
ook.2! We will attempt, however, 
are obtained and inter- 


Explanation of the mathematical proc 
is clearly beyond the scope of this textb 
to help the reader understand how factor loadings 
preted. m 

In our discussion (in Chapter 3) of the comparatively low reliability of 
differences between test scores, we called attention to the fact that many 
tests overlap considerably in the abilities they measure. Test interpretation 
and test development are aided by à study of the extent to which tests 


24 For an introductory discussion of factor theory, and the use of factor analysis 
in studying the structure of human abilities, see Lee Cronbach, Essentials of Psycho- 
logical Testing (New York: Harper & Row, Publishers, Inc., 1960, Chapter 9; Jum 
C. Nunnally, Tests and Measurements: Assessment and Prediction (New York: 
McGraw-Hill Book Company, Inc., 1959), Chapter 9. For a brief discussion of both 
factor theory and methods, see J. P. Guilford, Psychometric Methods, 2d ed. (New 
York: McGraw-Hill Book Company, Inc., 1954), Chapter 16, or Philip E. Vernon, 
The Structure of Human Abilities (New York: John Wiley and Sons, Inc., 1950), 


pp. 1-24. 


132 THE EVALUATIVE PROCESS 


measure overlapping traits and by an interpretation of what these over- 
lapping traits or factors appear to consist. | | 

As an oversimplified illustration, we might examine data concerning our 
local test of arithmetic and a standardized intelligence test, which we intend 
to use with it, in the counseling of eighth-grade students wishing to take 
algebra. 


Reliability coefficient for arithmetic test .84 
Reliability coefficient for standardized intelligence test .94 
Correlation between the two tests .60 


The reliability coefficient for the arithmetic test indicates that 84 percent 
of the variance in scores is "true variance," while 76 percent of the vari- 
ance is attributable to error variance. If we square the r of .60 between 
the two tests, we find that 36 percent of the variance of scores on either 
test can be interpreted as representing overlapping abilities, common to 
the two tests." This leaves 48 percent of the variance of scores on the 
arithmetic test that is attributable to specific abilities, that is, abilities not 
measured by the intelligence test. 

Similarly, the variance of intelligence test scores may be interpreted as 
attributable to 6 percent, error variance; 36 percent, variance due to com- 
mon or overlapping abilities; and 58 percent, abilities specific to the intelli- 
gence test, that is, not measured by the other test. 

Factor analysis is a systematic procedure for studying the interrelation- 
ships between tests or other measures. Although a factor analysis cannot 
be performed with two tests, and should usually not be attempted unless 
we have ten or more carefully selected tests, this example gives some 
notion of what is meant by overlapping or common abilities. Certainly, in 
this case, we would need correlations with other tests before we could 
possibly interpret or name what "factors" are common to these two tests, 
which were not designed to measure the same abilities. 

A factor analysis study always begins with a complete table of inter- 
correlations among a set of tests, in which each r appears twice (Table 
4.6). Such a table is called a correlation matrix. A crude approach to 
factor analysis can sometimes be made by inspecting such a correlation 
matrix for groups or clusters of variables that show fairly high r's with 
each other. In Table 4.6, we can locate the highest correlation coefficient, 


25 See the explanation in Chapter 3, page 82, that 72 reflects the percent of vari- 
ance in one variable that is explainable or predictable in terms of variance in the 
other; for example, in a situation of the type we are discussing, a predictive validity 
coefficient of .60 would indicate that 36 percent of the variance in criterion scores 
is predictable from variance in predictor-test scores. If all persons had the same 
predictor scores, the variance with respect to criterion scores would be reduced 
by 36 percent. 


Validity 133 


which is .74 between tests A and E. We then examine the other r's with 
tests A and E to see if any other test has a substantial r with both of them. 
Test C obviously qualifies for a place in this cluster since its 75 with A 
and E are .63 and .57 respectively. Examination of the r’s for tests B, 
D, and F shows that they do not belong to this first cluster but that they 
show high intercorrelations with each other; the three 75 are .41, .48, and 


58. 
Table 4.6 


A Hypothetical Correlation Matrix Showing All Intercorrelations 
among Six Tests* 


i 


ORIGINAL CORRELATION MATRIX 


A B G D E F 
A C») .02 63 05 74 ло 
B 02 C) .03 E .09 58 
E .63 .03 C J .02 57 .09 
D 05 41 .02 (РО 12 48 
Е 74 .09 57 A2 C 05 
Е 10 .58 .09 .48 .05 € 3 
DENEN ЕБЕ 


WITH COLUMNS AND ROWS REARRANGED 


THE SAME CORRELATION MATRIX 
D VARIABLES 


TO IDENTIFY TWO CLUSTERS OF INTERRELATE! 


A E с р B F 
А C y 74 63 05 02 10 
E E ED .57 12 09 05 
с .63 57 C ) .02 .03 .09 
SO O OU S 
D .05 12 02 С) 41 .48 
B .02 .09 .03 Al t) .58 
Р 10 05 09 48 58 69 
о 


ests (for example, А and F) look down 
in the А row and find the same value 
twice in the correlation matrix. The 


"To find the correlation coefficient between any two t 
the A column to find an r of .10 in the F row; or look 


of r listed in the F column. Each correlation appears 
diagonal cells represent the correlation of each test with itself; with perfect reliability, each 


of these r/s would be 1.00; however, different values аге used at different steps in the process 


of factor analysis. 


The two clusters are more easily seen in the bottom half of Table 4.6 
Where we have rearranged the 75 so that the variables in the first cluster 
are in the first three columns and rows. Because of a clear-cut pattern of 


134 THE EVALUATIVE PROCESS 


interrelationships, which is rarely found in actual research studies, we have 
identified two clusters of tests that seem to represent common abilities or 
factors. If we could then study tests A, E, and F to see what psychological 
processes they seem to have in common, as well as the ways in which they 
differ from the other tests, with which they show low 75, we might be able 
to suggest a name for an ability or trait that the tests in the cluster seem 
to be measuring in common, which is absent from (or greatly diminished 
in) tests outside the cluster. 

However, clusters of tests are rarely as clear as in the hypothetical set 
of r's in Table 4.6. Inspecting a large table of correlations for possible 
clusters is difficult and uncertain. The mathematical procedures of factor 
analysis surpass those of cluster analysis in several ways. Factor analysis 
involves less subjectivity, achieves more meaningful solutions, and gives 
greater assistance in clarifying what each test measures," 

Every factor analysis stud 


y ends with a factor matrix, such as the one 
shown in Table 4.7 for the 


SRA Achievement Series, which shows the 
factor loading’? or weight of each of the factors in each of the tests. Since 
a factor loading Tepresents the correlation of the factor with the test, we 
can square the factor loading to obtain the Proportion of variance in the 
test attributable to each factor. For example, in Table 4.7, each factor 
loading for the grammatical usage test can be squared and the variance 
in scores on this test interpreted as follows: Squaring the factor loading of 
-72 gives .52, or 52 percent of the variance attributable to factor I; simi- 
larly, squaring the factor loading of .41 gives 17 percent of the variance 
attributable to factor III; negligible amounts are attributable to the other 


factors. The factor loading of a test on the chief factor it was designed to 
measure is called its factorial validity. 


or Analysis (Prince- 
1954), Chapter 2. Raymond B. Cattell, 
› Publishers, Inc., 1952), Chapter 2. 
27 Through the mathematical procedures of factor analysis, the researcher can 
h factor. These estimated correlations 
f the factor loading for factor I on a 
indicates the proportion of the vari- 
ance in test scores that is attributable to examinee differences with respect to factor 


ctor I, the variance in scores on this 
specific test would be reduced by that proportion. 


Validity 135 


Table 4.7 
SRA Achievement Series, Battery for Grades 2—4 
Factor Loadings for Grade 2 (300 Cases) 


ee 


Rotated Orthogonal Factor Loadings* 


I п ш ІУ У 
Comprehension .85 15 —.03 .05 14 
Vocabulary .85 28 .02 —.03 —.04 
Capitalization and Punctuation .60 —.02 .40 .05 .00 
Grammatical Usage A2 .02 41 -:10 .07 
Arithmetic Reasoning .78 25 .02 21 .00 
Arithmetic Concepts .68 .29 —.03 :36 .10 
Computation 46 13 06 40 00 
Auditory Discrimination .50 20 .00 :00 50 
Visual Discrimination .38 20 00 30 33 
Sight Vocabulary 78 —.11 —.08 ло 48 


i 


Source: Adapted with the permission of the publisher from Louis P. Thorpe, D. Welty Lefever, 
and Robert A, Naslund, SRA Achievement Series, Technical Supplement (Chicago: Science Re- 
search Associates, Inc., 1957), Table 19, p. 25. 


“Factor I—the general achievement factor. “This factor occurs because the various tests 
Measure, in part, the same cognitive skills. These skills, which are common to all cognitivetypa 
tasks, are assumed to be almost the same as skills that are measured by the general factor in 
intelligence tests.” 


Factor Il—the symbolic language factor. T E dud 
factor deals with such symbolic processes as abstracting, interpreting, relating, deducing. 


Factor 11—15 composed of the two language arts tests; hence can be а уз structural 
language factor, since it measures the knowledge of rules ‘about the cpr eal 
Factor IV—is labeled quantitative accuracy and principles. It is made up of the three arith- 


metic tests, | 
Factor V—at the second-grade level consists essentially of the three language perception 


Each of the tests with significant loadings in this 


tests, 


To facilitate the work of factor analysts in the area of aptitude testing, 
the Educational Testing Service has developed a kit of reference tests for 
Cognitive factors. These tests have been selected as having high factorial 
validity. The original 1954 kit has recently been revised; the 1963 kit in- 
cludes 74 tests of 24 aptitude and achievement factors. Now that com- 
puters can do most of the drudgery of factor analysis, a number of very 
comprehensive, well-designed studies are being made. While most of the 
early studies involved 10 to 20 variables, it is not uncommon for recent 


136 THE EVALUATIVE PROCESS 


studies to involve 60 variables, with less work and time being required 
than in the pioneer studies.” | 

As factor analysis studies become more extensive and more carefully 
designed, their findings will increasingly converge, confirming the existence 
of many factors. Fruchter includes a list of factors that have been verified 
in three or more studies.” We are gradually establishing a reference system 
of factors, in terms of which different tests can be described, We must, 
however, guard against an interpretation of factors as underlying dimen- 
sions of ability or temperament, which inevitably unfold from genetic po- 
tential and will be found in the same patterns in all cultures. Factors are 
useful, meaningful dimensions; they arise from the interaction of genetic 


factors with patterns of environmental factors that are interrelated in a 
specific cultural setting. 


The “factors” of factor analysis really correspond to sets of responses whose 


joint occurrences are conspicuous in comparison to those of other possible sets 

of responses. . . . When we try to “interpret” a factor, therefore, we should 

attempt to consider the source of the frequencies of co-occurrence. 
There are many possibilities. Among these are: 


: (1) the learning of the per- 
formance of one response is prerequisite to, or implied in, the performance 
of another . . . (2) the learned behavior represented in one response has trans- 


ferred to the other response; and (3) because of the accidents of personal his- 
tory or because of the common experience of certain numbers of the group of 
persons, there was a higher probability that both of any pairs of responses were 
learned together than that either would have been learned alone.3° 


All three of the possibilities listed above would help to account for the 
fact that individuals may score high or low in the numerical factor, with- 


out necessarily having a comparable score in the verbal factor of mental 
ability. 

кошу, computers have also made it possible to perform factor 
an 


ysis studies of items, with each 
An author can develop or select a 
measure certain personality traits a 


item being considered as a variable. 
large number of items that seem to 
nd determine through factor analysis 


28 Raymond B. Cattell, Fa 
Inc., 1952), p. 386. 


29 Benjamin Fruchter, Introduction to Factor Analysis (Princeton, N. J.: D. Van 
Nostrand Company, Inc., 1954), рр. 197-198, citing J. W. French, ed., Conference 
on Factorial Studies ој Aptitude and Personality Measures (Princeton, N. J.: Edu- 
cational Testing Service, 1952), p. 12. s 


30 John B. Carroll, “Factors of Verbal Achievement,” 
Testing Problems (Princeton, N, J.: Educational Testing 


ctor Analysis (New York: Harper & Row, Publishers, 


Invitational Conference on 
Service, 1961), pp. 12-13. 


Validity 137 


the items that should logically be grouped together into subtests. Such 
factor analyses of items have been made on currently used personality 
inventories, with disconcerting results.?' 


The Concepts of Convergent and Discriminant Validity as Aspects 
of Trait Validity 


Campbell emphasizes that the validity of a proposed trait or construct 
(such as “social introversion") should be carefully studied to see if the 
trait is distinguishable from other traits and whether two or more meas- 
ures of the trait (by independent methods) tend to agree. 

Studies should also be made to determine whether individual differ- 
ences in trait scores are largely attributable to response tendencies”? (such 
as tendency to agree with generalizations, to admit symptoms, or to answer 
questions in such a way as to create a good impression). For example, a 
great deal of research work was done with the California F Scale (de- 
signed as a measure of authoritarianism) before its trait validity was ade- 
quately studied. It was unfortunate that many studies were made concerning 
the relationship of “authoritarianism” to “rigidity” and other traits before 
it was found that a response tendency to accept overgeneralizations was 
One of the major factors in the “authoritarianism” scores; in fact, persons 
who agreed with the extremely worded, cliche-ridden statements of the 
California F Scale tended also to agree with their reversals.?* . 

Each test, rating scale, or other measure is a trait-method unit; and 
some of the variance in scores is due to method factors that are unrelated 


studies of the Minnesota Multiphasic Inventory by Comrey and 


*! For example. 
ће erence Schedule by Levonian et al., have 


Soufi, and of the Edwards Personal Prefi 
indicated that the present grouping of items into subscales on these two tests departs 


markedly from the grouping that appears to be desirable on the basis of factor 
analysis studies. Andrew L. Comrey and A. Soufi, "Further Investigation of Some 
Factors Found in MMPI Items," Educational and Psychological Measurement, vol. 
20 (Winter 1960), pp. 777-786; E. Levonian ef al., “А Statistical Evaluation of the 
Edwards Personal Preference Schedule," Journal of Applied Psychology, vol. 43 
(December 1959), pp. 355-359. 

82 Donald T. Campbell, “Recommen 
Construct, Trait, or Discriminant Validity," 


(August 1960), pp. 546-553. x А 
73 A response tendency may be defined as "any tendency causing a person con- 


Sistently to give different responses to test items than He. would when the same 
Content is presented in different form." Lee J. Cronbach, Response Sets and Test 
Validity," Educational and Psychological Measurement, vol. 10 (Spring 1950), pp. 


3-31. E 
з Н, E. Titus and E. P. Hollander, "The California F Scale in Psychological 


Research: 1950-1955,” Psychological Bulletin, vel. 54 (January 1957), pp. 47-64. 


dations for APA Test Standards Regarding 
The American Psychologist, vol. 15 


138 THE EVALUATIVE PROCESS 


to the trait itself. For example, Thorndike”? found that ratings of teachers’ 
voices correlated .63 with ratings of their intelligence. He realized that 
this correlation must greatly overestimate the relationship between intelli- 
gence and pleasing quality of voice; that the correlation was largely due 
to the “halo effect" of the rating method. In other words, the rater who 
perceived a certain teacher as a superior teacher tended to rate this teacher 
high in these and other traits. 

In studies of both concurrent and predictive validity, fairly high corre- 
lation coefficients, representing agreement or convergenc 


€, were sought as 
evidence of validity. However, the construct validity of a 


test (as a measure 


trait should also 
inant validity—low correlations with tests from which it is 


moral knowledge” tests devel- 
relate more highly with intelli- 
A test of moral knowledge has 
correlates higher with tests of 
oes with other tests of moral 


inadequate trait or construct validity if it 


intelligence and reading ability than it d 
knowledge. 


multimethod matrix? as 
ying the construct validity of a new test or 

ix is given in Table 4.8. Measures 
ch other than with meas- 


d. Through a table of this 
type, one can study the convergent validity between independent measures 
of the same trait and the discriminant validity between measures of dif- 
ferent traits. 


35 E, L. Thorndike, *A Constant Error in Psychological Ratings,” Journal of 
Applied Psychology, vol. 4 (March 1920), рр. 25-29. 

36 Campbell, op. cit., р. 548. 

57 A multitrait-multimethod matrix i 
representing at least two traits, each measured by at least 
is given in Table 4.8. Note that each coefficient appea 
twice as in the correlation matrix used in factor analysis, 


Validity 139 


а fairly elaborate study as the basis for his master's thesis, three methods 
of obtaining information on each of the traits were used: 


METHOD 1. TEACHER RATINGS Each student was rated independently on a 
5-point rating scale on “seriousness of purpose" and the other traits by the 
three sixth-grade teachers who had worked with him on the teaching-team 
program. The average rating for each pupil on each trait was translated into a 


stanine score. 


METHOD 2. TEACHER NOMINATIONS Each teacher was asked to nominate 
20-25 percent of his students whom he would describe as outstanding with 
Tespect to "seriousness of purpose" and each of the other traits. He was also 
asked to nominate 20-25 percent whom he would describe as least adequate 
in this respect. To obtain the raw score for each pupil the number of negative 
mentions was subtracted from the number of positive mentions. Pupils not 
mentioned were given a raw score of 0. Again, raw scores were converted into 
stanine scores. 


METHOD 3. PEER-GROUP NOMINATIONS During the last month of the sixth 
grade, each sixth-grade pupil was asked to nominate his classmates for each of 
the traits, following the nomination procedure described under method 2. Al- 
though the raw scores were much larger under method 3 (because of the many 
peer ratings involved), the use of stanine scores made the scores comparable 
In size to those obtained under method 2. 


In Table 4.8 are presented the intercorrelations of scores obtained on 
three traits by each of three methods. In the column headed A,, for ex- 
ample, are the correlations between scores on trait A by method 1 (teacher 
rating) and scores on all the other trait-method units. A careful study of 
the explanation of symbols (at the foot of the table) is necessary to inter- 
pret the table in terms of convergent and discriminant validity." (The 
term “test” is used throughout even though ratings are involved. The pro- 
cedures involved in using this type of matrix to study construct validity 


are the same, whether tests or ratings are involved. ) 


28 А check should first be made to see whether each r is significantly different 
from zero. Tables for checking on the statistical significance of r’s are given in 


Standard textbooks on statistics, for example, J. P. Guilford, Fundamental Statistics 
in Psychology and Education 3d ed. (New York: McGraw-Hill Book Company, 
Inc., 1956), p. 539. If data for one hundred students or more are involved, all the 
"s in Table 4.8 would be significant at the 1 percent level. That is, there is only 
Опе chance in 100 that an г as large or larger than .20 would be obtained if the 


« 
true r” were zero. 


140 THE EVALUATIVE PROCESS 


Table 4.8 
An Illustrative Multitrait-Multimethod* Matrix (Table of Intercorrelations) 


= = = А 


Method 1: Method 2: 


Method 3: 
teacher teacher- peer-group 
ratings nomination nomination 
on traits Scores on traits Scores on traits 

A B C, А B, с А B3 с 


METHOD I: TEACHER 
RATINGS ON 
A, Seriousness of 
purpose 
B, Responsibility 
C, Originality 


METHOD 2: TEACHER 
NOMINATION SCORES 
ON 


А, Seriousness of 
purpose 

B, Responsibility 

C, Originality 


METHOD 3: PEER-GROUP 
NOMINATION SCORES 
ON 


A, Seriousness of 


ги 1 
purpose 1448 7.52 291! „40545 

B, Responsibility 1.56.6431! | 

C, Originality i30. 28545 128 3o. 
Triangles drawn with Solid lines enclose 75 involving same 
method, different. trait. These triangles and the adjacent relia- 
bility coefficients make up same-method blocks. 

rel 

a 

! “ss, Triangles drawn with 

Ll 


broken lines enclose r's involving different 


Rm oo oe EC method, different trait. 


Validity 141 


The guide lines for interpreting data from such a multitrait-multimethod 
matrix are as follows: 


CONVERGENT VALIDITY 1. The validity coefficients (boldface figures) should 
be sufliciently large to justify further research on the construct. This condition 


seems to be satisfied for all three traits. 


DISCRIMINANT VALIDITY 2. Each validity coefficient (boldface) should be 
higher than the values in its column and row in the triangles drawn with broken 
lines; that is, a validity coefficient should be higher than 75 obtained between 
that trait and any other variable having neither trait nor method in common. 
A check of Table 4.8 reveals that the validity coefficients for trait A do not 
meet this standard, but that those for traits B and C almost meet it. 

3. Each "test" (trait-method unit) should correlate higher with other tests 
measuring the same trait than with tests involving the same method and de- 
Signed to measure different traits. Examination of Table 4.8 reveals that this 
Standard is not even approximated. The r's in the same-method triangles have 
à higher average than do the validity coefficients. Especially high, for example, 
is the r of .70 between teacher ratings on “seriousness of purpose" and “respon- 


sibility” and also the r of .75 between peer ratings on these two traits. Method 


Variance equals or exceeds trait variance in the instruments studied. In other 


words, such method factors as the “halo effect" in ratings account for much of 


the convergent validity reflected in the validity coefficients. 


Summary of Procedures Involved in Constructing and Validating Tests 
for the Fourth Purpose 


A study of Table 4.9 helps the reader to realize that many of the tests 
used in guidance and research are measures of traits or constructs. Hence, 
No single criterion can be used in test development and test validation. 
Many interrelated research studies are needed, and test scores become 
More meaningful as more and more of the necessary data are cumulated 


and interpreted. 


SUMMARY STATEMENT 


As the student has already discovered, validity is a sos aniental and crucially 
important concern in any type of measurement. E test does not have validity 
in general, but only in terms of its use for specific са апа with specific 
groups. Hence, the subject of validity is the most complex one in the entire 
field of measurement. 

The term “validity” refers to the У 
ments about examinees. For a test to 


alue of a test as a basis for making judg- 
be valid, it must measure “something” 


THE EVALUATIVE PROCESS 


142 


"perpuis 
Suroq wen əy} оу 122dso1 yım Фпол8 спогиодолојоц v oininsuoo ОЦА р'заоптшехо Jo dnoi8 одлеј e 01 152] рәтлә1 ош :ojstutupy 'Е 
‘swa ojoduroo 


9) $әәшшехә це мое цам Jey} peusiqvjs? од poys syw own “еп ou) jo Joadse ред v st ооџешлојлод ш poods ssouf) '2 
2'5атоџорџој asuodsay (Z) 


«'әәшшехә oq) 0} Suneorunuiuoo ш Áed JO xoc] (1) 


0) әпр әзиемигл рцелш зопрал л Sv 1591 ƏY} JO шлој рив 'juojuo иопоолр ш ѕ00181лә1 YƏNS IJEN "9 
»'ооџомодхо 


pezierds uo poseq обројмошм 10 ‘Arejnquooa Jo Aypiqe Фитрол оделолу-олоде oxmbar jou ор $шә од) jeu] олп5 омер "v 
әјішехә 20у ‘eny og) JO Juətussəsse оц ojeurumjuoo рјпом jen slojovj oziumuiur ‘UOISIAAL pue Buyum шә ur ered убполуј, 'z 


* (SWI 159] 290381119} 


-ш uo sdno13 oSv se) уен əy} о} joodsor ym лодир ріпоцѕ ey} sdnoi8 Áq sw}! Qj sosuodsoi [епџолојир ио vjep peomdwqg 'q 
‘uea əy} Jo олпјеш og) jnoqe sasayjodAH ‘e 


JO 51509 əy} uo ‘pen əy} o1nseour o} poumsoid шә! jo лодшти облеј v ostio. `1 


9504ҸП4 SIHL ЧОЧ SISAL AO INS3JXdOT3AS3Q яні моз занпааоома GAZIIVYANAD 


'(591025 џомоџло ogroeds jorpoid ој uey} JoyyeI suosiod 
әділоѕәр oj posn шоца) sonojuoAur jso1ojur 20 'sorojuoAur Ájvuosiod 'оптае Suryury} олреоло 10 ооподујојш |елопод JO SISIL 


asOdund SIHL ЧОЧ Q3Sü SISAL AO SAdAL 


"уѕәләуш 10 “уел Ayeuossod ‘әріде jo uioped sm jo зшлој ur uosiod v одмовор о} ‘Áyprqnsə33ns 
SIy 10 ‘әләт 0) UOHLAHOW jo [әлә 510 "ssououo1d-Ajorxue 5,1051904 v Зшиләзиоә 521025 159] ШОЈЈ зоополојш одеш ој :S3TdJAVXd 


2oupuuo[12d 1521 ayy 
uj ројооугл «уфоштегла 'jonajsuoo 40 лр тооџоцјод«у v sassassod uosaod v цоп ој 22482p. ays Зитилооиоз воиолојиј гјрш ој 


(заууч яя OL 1мяйоапг NO) ONILSAL NI WIV 


“ЕЕЕ 


4әпщѕиоо AO p01] peio[nisoq D uo snipig јопр! при! 
BuipuoBey se»uejeju| Bupppw—esoding {ипо eui аллоб ој резп 559] jo иоңорцод pup џоцопдеио5 
6'v 9199] 


143 


Validity 


прод 10 *10rA?qoq јивлајол рив yer 
99) uoo^]oq sdiqgsuonv[or Suruioouoo sosoyodky ino poseq ом yorym uo Arzoay} [eoido[ouoÁsd әц (с) їед ројејпјвод оф 
(1) JO Ќурцел ayy uonsonb Pinoys o^ лошоцм Surdpnf ur sn djay ој зоџорто JoyyNy poou ом 'рошлуџоз jou 51 sisoqodáqg v 
JI "Плод оф jo Aupyea Əy} pue 150] од) Jo Апрцел ош 1109 310ddns pinom әзпәрлә uons 'pounguoo sr siseujodíu v ла 
`Каоәц үеої8о[оцо4ѕ4 uo poseq Sesogodíq Аџлол yey) sKeA ur ѕәәшшехә 8и11025-М0] 
Jo Wey} Шолу ѕләрір зоошшехо 8ш1025-ц81ц jo 1омецзд OY} ji 295 О} pausisap Ајетзод5 за рүпоцѕ suonenjr [ejusuruodxq 'e 
ma pore[nisod əy} uo «мој ApoAnv[o pue “уйщ K[oAre[o1 
91055 oym suosiod jo лотлецод ou Suju122u02 $әзәц1ойАц 5152] pue Soje[nur10j 1oje3nsoAut oq цол ur Sorpnjs JO sanos e IJEN *Z 


ла 


Mau од 421/72 (7) ел эц} Зшып$кәш 10j ayenbape st (s)o1nseour лошо ә 
-әәл8е шеу моц5 jou op уел v Sunnseour jo spoyjow әлош 10 OM} ‘әлоде pourpno Apnys оцу шолу 58шірщу оў 8дшрлозое "ule 


"sopuopua) asuodsar oi joodsoi ЦИМ soouoiogip [епрїАїрш se yons *oouenreA рцелш jo 

5991105 o[qeqo1d әлпѕеәш zey} sjsoj opn[our pinoys dno13 SHLL 'suonv[o1r02 moj Ápirej jorpod p[noa ouo YYA 107 SISIL 'q 
"meo ројејлузод4 əy} ој ројејај *K10og Те2180јоцо 

-Ásd о} Витрлозое ‘aq pinous yorya Səlqeea әлпѕвәш ец} 5лошо pue “елп Ројејпјзод. оф 


“Ur juosoidoi yey} sisə} әшоѕ oq pynoys әлоцу Фполд SI) ШИМ csuone[o1102 y3iy Ајтеј jorpoud рјпом ouo YM 10J SISIL e 


uo gep $әәшшехә JULS əy} 107 urejqo 


OS[V "(pougisop sva 1521 оц} yərym 107 sdnoi3 ay} jo олпејџозолаол) ѕәәшшехә jo sdnoig 19470 О} 1591 PostAar oy} ләјѕішшрү "с 


27159) Цовә јо Апоподошоц 
эц Sursosdwir 10у сроцјош лофо osn 9191599 jou st sur jo 515 [еше 20300] v JI 2'21025 меп oq јо Ayoussowoy oq әлишхеш 


II^ Jey} He) ovo ЈОЈ sway soy) joo[os '(o[qerieA v se Woy! 359) Yous 3unroprsuoo) sura jo sisA[eue 10jotj v jo suvaur Ag ‘p 


ааа MMC EVE EMG SN EUM ee ee 


опазио 10 шор p81b|nisoq D uo SN{DIS |Pnpiaipu| 
BuipapBey so»ueJaju| Buryow—asoding чипо- eui 9A19g ој резћ sisal јо uonpbpi[DA pup uononaisuo? 
(penunuo) 6' ejqo, 


THE EVALUATIVE PROCESS 


144 


91D peis|| ѕәцілцор eui JOU, eins expur osjo 'pupisjepun jou рјпом 
sjuepnis euros py} suo |озлицзеј әлјолш jou ор peisi| зашалцоо 
eui 1994; әлпѕ әўош 'Алодшели! {ѕәләұи! up бшаојәләр ш "ајашохе 104, 


'sD31JD 2504; ШІ 58025 BAD 
әз yayo Ајприп jou soop spain ysasajul зпомол ayy и! гзигџадха 
|puuo1àgtp poy} os "uri pejuronbop га рјпом sjuapnis ysow yoy} souo 


у y (врофош jua19grp Ад рәлпѕвәш 

suea зшәләрір pue ‘poyjaw owes oy} Aq painsvaw spea зшәләрір SurA[oAur sdigsuonv[o1i) Апрпел jueuruios;p јо oouoplAg 'q 

"(аге ours әці JO золповош juopuodopur изомјод suonv[o1l102) Апрпел juog1oAuo2 JO oouoprAg `V 
әшшрхә инә лозп əy} зец) os ројџозола oq р[поцѕ (Fp o|qe] 0} 1e[murs) хеш poujoum[nur-jmJjn[nur y '6 
'501028 195 әѕиойѕәл op uey) enap ojenidogddv-jmj јелолоз JO ovo yra JOYSIY soje[ol109 1501 Aou ош jeg; рә}ед$иошәр 
aq pinoys 1] "poirodoi oq рјпоц 5195 osuodsai лоцјо 10/pue (ээле ој Аоџорџој) oouoosombov jo 129go Əy} ozrumurur 0] 5311005 
pue uoronujsuoo 152] ur pasn sa1npoooid (ad) uorduosop-j[os ÁrejunjoA əy} Jo Á10juoAur зәщо 10) Алојполшт Апјеџослод e 104 
“лоор „Аипаелзор [VI20S, [e1ouog əy} soop uey, eray ojeridouddu-j161] [10498 JO 


yous YIM dougn soje[ol102 1521 MoU IP jeg) рәјелѕиошәр oq PINOYS 1] 'рајгодол oq p[nous 01025 џец ио иделәр [BIOS JO 
sjoago oy} oziurumu ој posn soinpoooid ‘(ədÁ} uorduosop-j[os Árejun[o4 ay} jo oinsvour лошо 10) Алојполшт Ájpeuosiod v 104 ', 


© 


1591 әзиә8гә}ш од SIOP UEY} PNAN ojenidouddv-j:; үе1әлә$ JO qovo ЦИА лошег soje[o1102 1591 
ou YI payesjsuowap oq p[nous j| 'ројлодол oq p[nous 5591 oouegi[[ojuI олош ло ouo uo 591025 YIM 591025 ЕЦ JO зиоцејеллод '9 


"иеш oui yar sdiqsuonv[or y3iy 10 'ојелорош “мор олоц рүпоц$ “(Ало [ео180үоцоќѕ ој Surp1oooe) yey зојдемел [еиләјхә 

рие jm] oy} uo 521028 uooAjoq sdiusuonv[o1 тподе sasayjodAy Jo 51521 Suru1oouoo vjep o[qv[rAv [е odar рјпоце [enueur ou], 
*pojuosaid oq pjnoys 521025 152] uo paods jo yaya ou Зшиләоиоо oouopiAo "rump oum e ЧИА UoAIS sr 159) AP JI "b 

"рарпјош oq pjnous oinqu oui 

pape 1ugrur jeg] зопомојовлецз тощо рию “хәз ‘ade ш Зшиәрр sdnoi8 тоу оопешлојлод 152] ш ѕәопәләрір 8шилозиоз uomnvtu1ojug 
"paroda oq p[nous ојпамује owes əy} Jo sounseour poidoooe Á[[u1ouo8 'pousrqnd Aue Чим 1529] ƏY} JO suonv[orroo ou], 
"рарглола oq p[nous 352) uoeo jo Áouojsrsuoo [vuiojur 10 /jreuogourog оф Zuru1oouoo әоиәриЯ 'T 


"E. 
КА 


ISAL V dO ALIGIIVA LONULSNOD AHL ONIQNVOdN тупмум ISAL V NI азшлона Ad атпонѕ LVHL NOLLVANHOJNI 


nm 


Jonjjsuo) 40 џрлј poejp[njsog р ио snipjg [опрілІри 
BuipipBey se»ueJeju| Bubjpw—esodung ynog eui әллә ој pəsn sisə, jo uonppi[bA рио иоңопщиогу 
(penuuuo5) 6v әдә] 


145 


Validity 


"£Sg-9pg “dd /(0961 isnBnw) GL “JOA "iiBo[oua/sq иозџешу әчү 
„'Кирцод jupumunosiq 20 ‘poig ^2n45u05 Вшрлобом spippubig 
450] Ydy 10 ѕиоцориәшшоэәу,, "|oqdu, *| рјочод oi рәзәәз st 
}чәрп ay; 6 YBnosyy ç syuowasnbos 20} suonpaisn|ir |ошотирро 194, 


‘gel әбра 
uo рәшојахә оло Аир!үол „уиошшизир, pup „дчаблалџоз,, suo, UL y 
"4561 "эщ "ош рио џоцеш ‘оң 4104 
MON) (3109soy рио ұиәшѕѕәѕѕу Аџјоиозлод ш ојдомод Аидолзод 
[91205 ey) 'spaowpg '] “у 995 '(мојзелдаш! рооб о ојое> Op 50 os 
puodsos o, Aouopue, oui) „Аијдолзор jopos, Бшилазио2 ц2109502 104, 


7121-46 
‘dd /(2661 Чэгоүү) бу “JOA ʻuyajng [о21бојоцэќ=4 ,'Bunso| |0260] 


-ou»Ásd рио |оџоцозпра oj uonpjoy ш! sisÁ]puy uei, "smog 'g 3, 
'681-191 "dd ‘(egl әшпг) 81 ол "одијашоц2л54 ,/suej jo 5199 
“WAN 9810] Бишојзод јо рощәүү y, auim "f "g рио Амоцм ^y "f 


:ejdumxe зо} 'иоцојоллозледц jo зәдшпи oBip| о uns Burajoaut 
wajqoid p Bumpoojddo 10} редојелор ueeq eany зроцјеш үорәйс„ 


“yy d (6661 "эщ 
"Кчодшођ 400g |ин-мозеуј пород MƏN) иоңо!рәлд рир juoWssessy 


isyueweinspew рир sjso| “af "Ајомипм '5 WAL 'ројзо; eq Pinoys ѕәәш 
-шохә QOE 49] 4D 'e|qois Ајшерууп5 eq oj sinso1 eui 20} ларло ШЕ 


*$ә102$ 459, ио 
sepuapue, esuodsai jo səyə əy; әзпрәл оф рәѕп eq und (ѕиоцѕәпЬ 


9210ц2-резлој so tns) spoujeur |о12895 "ojqispoj j| ,,uoissa1dui рооб о 
840913 |||^ JOU} sesuodsa1 osoou» oj Азџерџај,, 10 „зелбо oj Aduapuey,, 
чо 91025 D SD yəns 'posi^op oq {цбіш sei02s jos osuodsoi 9jpipdag , 


тәң! Əy; рир 
4" A|[puorso220,, 10 „ЌциәпЬәз},, so црпе земобајоз esuodsei jo иоцо; 
-әзйләрш! 'suonsanb jo uoissiuo Бшилезиоз ѕиоц2әлр 'ej|dupxe 104 " 


1 


i»njjsuo) 10 ролу peijp|njsod p uo snibig |опририј 
BulpsoBay se»ue1sju| Bupppw—esodiung qunoj ay} әлә ој pəsn sisə, jo џоцорцол pup иоцопцѕиоо 
(penuuuo5) 6? зјаој 


146 THE EVALUATIVE PROCESS 


i ith reliability), and that "something" must be either a repre- 
niri тен. the behavior we wish to judge, or it must have demonstrated 
relevance to that behavior. . 

If we wish to know how individuals perform at present with respect to 
certain skills and knowledges, we try to devise a test which samples those skills 
and knowledges. If the behavior to be measured can be exactly defined (such 
as the correct spelling of the seventh-grade list of spelling words), content 
validity can be attained by including in the test, a random sampling of the 
complete list of spelling words, multiplication facts, or some other defined uni- 
verse. Whenever the skills or knowledges to be sampled cannot be accurately 
defined and randomly sampled, human judgment must enter into the selection 
of learnings to be tested. As a guide for test construction, a table of specifica- 
tions of the test content should be developed and followed. Our concern, 
however, is still with the representativeness of the test sample in terms of the 
universe about which we wish to make inferences. 

If we use multiple-choice questions on spelling and arithmetic, or some other 
indirect procedure, for measuring the criterion behavior about which we wish 

make judgments, we are obligated to assess the concurrent validity of our 

test, that is, the relationship of individuals’ test Scores to their results on some 
measure of criterion behavior external to our test. For the examples given 
above, the external criteria would be students' scores on dictation tests of 
spelling and on arithmetic tests involving actual computation. Or the concurrent 


validity of a group intelligence test might be studied by determining the rela- 
tionship between students’ s г scores on an individual 
intelligence test, 


о aid counselors 
he future attain- 


ment of certain criterion Scores, ts' obtained test 


on the basis of the studen 
Scores. 


У, independent variable, pro- 
P T tic ‚ Such as increasing the percentage 
of tra who s i 

of t о succeed in a traini search studies or 


| б alidity of a test which presumes to measure a 
relatively pure trait, severa] criteria must be used. Е, 


i i ог example, many hypoth- 
eses can be made concerning the differences in test and monger beep 
between groups of high-scoring and low-scoring students on a specific intelli- 
gence test. If correlations were obtained between scores on that test and several 


Validity 147 


other variables, psychological theory would predict high correlations with some 
variables and low or negative correlations with other variables. Confirmation 
of hypotheses based on theory would provide evidence for the construct validity 
of the test. 

Factor analysis is used to study the extent to which tests measure unitary, 
independent dimensions, and also to help in interpreting the major sources of 
variation in test scores. The factor loading of a test on the chief factor it was 
designed to measure is called its factorial validity. 

In order to assist the student in gaining a thorough understanding of this 
complex subject, the authors have prepared four summary tables, one for each 
of the aims listed in the "Technical Recommendations." In each table, we 
have given a generalized outline for the devlopment of tests of the specified 
type. These procedures are not intended as a short course in the construction 
of standardized tests, but rather as an aid to the student in understanding the 
four types of validity and in comprehending what procedures should be used 
by test authors and publishers. They should also deter the student from amateur 
test construction, except in the development of tests for the first purpose, that 
is, measuring student achievement in knowledges and skills in the subjects he 


teaches, For example, if a student who is undertaking a research study for his 
Master's thesis decides to develop a test or inventory to measure some trait, an 
examination of Table 4.9 will help him to realize how far his procedures differ 
he should be in making inferences 


from those recommended, and how cautious 


from his test data. 

In order not to obscure the major points in each of these tables, much of the 
illustrative material has been placed in the footnotes. The student is urged to 
read carefully all footnotes giving illustrative material, for such study will 
greatly increase his understanding of the generalizations listed in the tables. 


SELECTED REFERENCES 


CAMPBELL, DONALD T., “Recommendations for APA Test Standards Regarding 
Construct, Trait, and Discriminant Validity,” American Psychologist, vol. 
B Augst: 1260); Pp: aep ed., Encyclopedia of Educational 


CRON "Validity," in C. W. Harris, 
BACH, LEE J., “Validity,” in n Company, 1960. 


Research. New York: The Macmilla npar | . 
= —, AND ака E. MEEHL, "Construct Validity p и Tests," Psy- 
chological Bulletin, vol. 52 (June 1955), рр. 281-39 m 
EBEL, € L., “Obtaining and Reporting Evidence on Content Validity,” 


Educational and Psychological Measurement, vol. 16 (Autumn 1956), pp. 


269-282. i "d i i 
FLANAGAN, JOHN C., “The Critical Incident Technique, Psychological Bulletin, 
357. a 
vol. 51 (July 1954), pp. 327-3 the Basis of Content Validity,” 


HUDDLESTON, EDITH M., “Test Development on 
Educational and Psychological Measuremen 
283-293. | 

KACZKOWSKI, HENRY R., "Using Expectancy Tables to Validate Test Procedures 
in High School" Educational and Psychological Measurement, vol. 19 
а iari Underlying the Use of Content Validity," 


LENNON, ROGER T., “Assumptions 
Educational and Psychological Measurement, vol. 16 (Autumn 1956), 


pp. 294-304. 


t, vol. 16 (Autumn 1956), pp- 


THE EVALUATIVE PROCESS 
148 


„ "How Substantial is a Substantial Validity Coefficient?” 
e ene ani Guidance Journal, vol. 34 (February 1956), pp. 340-344. 
R, DONALD E., AND JOHN О. CRITES, Appraising Vocational Fitness, rev. ed. 
id > York: Harper & Row, Publishers, Inc., 1962, Chapter 3. 
WESMAN, ALEXANDER G., "Expectancy Tables—A Way of Interpreting Test 
Validity," Test Service Bulletin No. 38. New York: The Psychological 
Corporation, 1949. Available on request. 


DISCUSSION QUESTIONS AND'SUGGESTED ACTIVITIES 


1. Cite specific examples of ways in which test results can be used for each 
of the four purposes outlined by the American Psychological Association (page 
106). 

2. Examine a standardized test in 
content validity. 


3. Indicate whether concurrent or predictive validity would be studied in 
each of the following situations 


your major subject field and evaluate its 


а. A school wishes to use a short form 
test. 
b. A school wishes to select students for an 


€. А college wishes to select freshmen who 
years of college. 


of an accepted reading readiness 


accelerated class in algebra. 
are likely to complete four 


What would the criterion be in 
4. When will a teacher ne 
achievement tests? How is the 
5. Administer a short multip 
the students take a traditional 


each of the three validation studies? 
ed to consider concurrent validity in selecting 


test involving these same problems. Show, by a 


these 


antages and limitations of each approach? 
What type of validity is being studied? 


esigned for the same grades. Describe and 
nd abilities measured, 


Application of tbe Principles 
5 of Measurement in tbe 
Selection of Tests 


In the preceding chapters we have studied three basic areas of concern in 
measurement. The processes of norming tests, estimating their reliability, 
and studying their validity have all been presented and illustrated, From a 
study of these chapters, the student has discovered (1) that there is no 
best type of converted score, but that each type has advantages and limi- 
tations for specific purposes; (2) that there is no arbitrary standard for the 
Size of a reliability coefficient and no method of computing reliability that 
is ideal for all purposes; and (3) that there can be no single basis for 
ranking tests in order of their over-all validity but that the validity of a test 
must be determined for each purpose for which test results will be used. 
In this chapter, we will focus our attention on the selection of tests for 
specific purposes; we will present a test evaluation form that is designed to 
focus the user's attention on his purposes in testing; and we will indicate 
the sources of information available to help us in the location of suitable 
tests and in the process of their evaluation. Before we discuss test selec- 
tion, however, we will consider (in the first chapter section) the different 


types of tests available for use. 


TYPES OF TESTS AVAILABLE 


АП tests are designed to obtain samples of examinee behavior. Instead of 
waiting to observe behavior as it naturally occurs in the course of daily 
life, the test is designed to elicit from examinees the behavior the test user 


wishes to evaluate. 


Standardized Tests 
made among tests is that between 


One of the obvious distinctions e 
ests. The essential characteristic of a 


teacher-made tests and standardized t 


149 


150 THE EVALUATIVE PROCESS 


standardized test is that the test materials, the procedures for administra- 
tion, and the procedures for scoring have been so carefully developed that 


the test can be given and scored in the same manner at different times and 
places. 


groups) is often called the standardization of the test; it should be evident 
however, that the standardizing of test content, proce 
actually a prerequisite to norming. 

The term “standardized test” has come to si 
ment with the following characteristics: 


E 


dures, and scoring is 


gnify a measuring instru- 


: $ arch is conducted to study its reli- 
ability and validity. 


5. A manual is supplied that explains the purposes and uses of the test, describes 


administering, 


scoring, and interpreting results, contains tables of norms, and summarizes 


available research data on the test. 


Classification of Tests According to Degree of Indirectness of Measurement 


ations. It is important 
observation in natural 


ce as possible regarding 
res, 


Tement of behavior in natural 
à representative sampling of 


the meaning of individual differences in test sco 
There are many obstacles to direct measu; 
situations. Direct measurement is based on 


Measurement Principles in Test Selection 151 


criterion behaviors in the real-life situations in which we are really inter- 
ested, Indirect measurement involves the measurement (usually by methods 
that are more reliable and efficient) of behavior that is presumably related 
to the ultimate criterion behavior (for example, to desired changes in stu- 
dent behavior or successful performance in a chosen career). Almost all 
of our tests, ratings, and other evaluation techniques involve measurement 
that is indirect in some degree. 

The widespread use of indirect measures may be largely attributed to 
several obstacles to direct measurement. 

Direct measurement may be impossible because of the delayed appear- 
ance of the desired criterion behavior. For example, one of our major aims 
in teaching nutrition is to have future homemakers plan nutritious meals 
when they have families of their own; one of our major aims in civics 
instruction is that our students (when they become adults) should study 
thoroughly all issues on which they will vote. Obviously such future cri- 
terion behavior cannot be directly sampled, but only indirectly approached 
through the study of what students can do, or will do, in current situations.' 

Current criterion behavior may be inaccessible, or not readily observed. 
For example, one of our ultimate objectives in driver-training courses is to 
have students drive safely when they are not under adult supervision; one 
of our objectives in teaching science is to interest our students to the 
extent that they voluntarily read articles in newspapers and magazines 
about advancements in science. It is not convenient or practical for the 
teacher to make firsthand observations in these areas. Hence, he m 
stitute a questionnaire filled out in class, which he recognizes as à very 
poor substitute for direct observation. . 

Another problem is the infrequency of current occasions for observing 
the desired criterion behavior. For example, the instructor in first piden 
lifesaving will probably witness no real-life situations in which the skills 
his students have learned must be demonstrated. One of the chief advan- 
tages of a test is the efficiency with which samples of behavior can be 
obtai TN 

SH eacus obstacle to direct measurement of criterion behavior is the 
lack of comparability of real-life situations from person to eis A visit- 
ing coach who tries to assess the players on a team Acte E pe 
game gives him an inadequate basis a Li eh a ee 
Opportunities to show their skills. He may be able to identify ose who ran 
at either extreme; the ranking of other players would not be justified. If a 


1Even if we did a follow-up study in an attempt to sample such criterion be- 
haviors for students who had been in the nutrition or civics Courses, as compared 
With those who had not, it would be almost impossible to infer that differences 
found were due to our instruction because of the many other factors that would 


influence delayed criterion behavior. 


ay sub- 


152, THE EVALUATIVE PROCESS 


coach controls the situation so as to make conditions more оте а ка 
parable among players, he has a better basis for comparing the payers v 
ity to bat balls or catch fly balls. When he does 50, however, ће is ta ing 
some of the steps that are necessary in developing а test. | 
Perhaps the chief reason for the popularity of indirect measurement is 
that direct measurement of many types of criterion behavior is very costly 
in time and effort, or so inefficient as to be impracticable. For example, 
if we want to make inferences about students’ spelling ability in their regu- 
lar written work, we would have to examine thousands of words of writing 
for each pupil. Moreover, students try to avoid using words that they do 
not know how to spell. Using a test list of spelling words, pretested to 
eliminate those that do not differentiate between good and poor spellers 
at this grade level, would constitute a far more efficient approach. And, 
if we can show that the scores on a recognition spelling test correl 


ate well 
with dictation-spelling scores, this still more indirect approach may be 
justifiable. 


Another obstacle is the complexity of most criterion behavior. The com- 
plexity of the criterion behavior implied in “success as a teacher” 
illustration. We could decide to observe t 
here the problems of obtaining 


is a good 
he actual process of teaching; 
a representative sampling of teaching situ- 
ations, defining criteria for judgment, and assigning scores in terms of what 
we observe, appear formidable. If we decide that we are interested in the 
product of teaching, rather than the actual performance, we again face 
formidable problems in assessing all the significant aspects of student 
growth, some of which are much more easily measured than others.? 

The various tests we use vary in the extent to which they approach 
direct measurement of criterion behavior, 

The most direct is the “ 
examinee is given special 
want to appraise his com 


work-sample” or "identical-elements" test, The 
occasion to do some of the tasks on which we 
petency. In this group we could classify most 
ation, reading comprehension, typewriting, 
ke. Many of the performance tests discus 
k-sample or “identical-elements” type. 


short- 
sed in 


2 Тһе preceding discussion regarding indirectness 
based on a list of obstacles to direct measurement p 
"Preliminary Considerations in Object 
ment (Washington, D.C.: American 


of measurement is largely 
t resented in E. F, Lindquist, 
ive Test Construction," Educational Measure- 
Council on Education, 1951), pp. 143-146. 


Measurement Principles in Test Selection 153 


strate athletic skills under controlled, rather than team-play conditions. 
In other tests of related behavior, test behavior is not very similar to 
criterion behavior. Many tasks included in reading readiness tests are not 
obviously related to reading achievement. Nor would one suspect that 
scores on the Minnesota Clerical Test would show a substantial relation- 
ship to production records for packers, wrappers, and inspector-packers.? 

A third basic test type is the “verbalized-behavior” type of test, in which 
behavior situations are described to the examinee and he tells how he 
would behave in those situations. The examinee may be required to indi- 
cate in his own words how he would plan an experiment, budget his money, 
or plan a nutritious meal. A nursing student, for example, might select the 
procedures he would use if a patient presented a specific pattern of 
Symptoms. В" 

A fourth type measures only the knowledge of facts and principles 
needed by the student in order to show the criterion behavior. We can find 
out through a test how much the student knows about a car and about 
traffic rules; how much he knows about the nutritional elements in differ- 
ent foods or about methods of first aid. The knowledge outcomes of educa- 
tion are important. Knowledge is necessary, but not sufficient, to the 
achievement of the ultimate objectives of education. 

This classification of tests according to degree of indirectness in meas- 
urement is one of the most basic. Although there are no sharp lines of 
demarcation, the four types represent various degrees along а continuum 
of relevancy to the criterion behaviors, on which we wish to compare 


individuals. 


Other Classifications of Tests on the Basis of Procedures 
There are many other possible classifications of tests. Many of these are 
concerned with such procedural differences as: 


1. Group tests (which can be administered to groups ог 0 individuals) vs 
individual tests (which must be administered individually). | 

2. Pencil-and-paper tests vs performance tests (the pedi ae er р 
applied to tests requiring the use of physical objects and the application of 
physical and motor skills). 


3. Speed tests vs power tests. In a spee 
imately the same difficulty; administrat 


d test, the tasks presented are of approx- 
ion time is limited so that none, or 


“The Selection of Department Store Packers and 
sychological Tests," Journal of Applied Psy- 
2299; Е. E. Ghiselli, *Tests for the Selection 
lied Psychology, vol. 26 (August 1942), pp- 


* M. L. Blum and B. Candee, 
Wrappers with the Aid of Certain P: 
chology, vol. 25 (June 1941), pp. 291 
of Inspector-Packers,” Journal of Арр 
468-476. Е | 

* A comparison of group and individual tests will be made in Chapter 6 on 


aptitude testing. 


154 THE EVALUATIVE PROCESS 


almost none, of the examinees can finish; the score reflects me amines 
speed of reading, typing, proofreading or performing some other ae in. 
In a power test all, or almost all, examinees are given su л © co p 
plete the test; the tasks are arranged in order of difficulty; the examinee's 
score reflects his accuracy and the level of difficulty at which he can success- 
fully perform. 


These differences in testing procedure are illustrated in Chapter 6 on apti- 
tude testing. 


Classification of Tests on the Basis of Content 


The only other classification of tests that will be considered in this chap- 
ter is based on test content. The major distinction here is between ability 
tests and tests or inventories of personality, interests, and attitudes. 


TESTS OF ABILITIES In tests of abilities 


- The goal is to measure the individual's maximum performance. 


2. The examinee perceives the situation as one in which he should strive for 
accuracy and provide evidence of competency. 

3. Comparison of results for different individuals is based on the assumption 
that all examinees are equally well motivated. 

4. There are external standards of correctness on which experts agree. These 
external standards provide the basis for a scoring key, by which all answers 
can be evaluated. 

5 


. The test author(s) attempt to reduce ambiguity of test content so that all 
persons will be working on essentially the same tasks. 


Tests of ability can be grouped into two major categories: 


+ aptitude tests, which are used to predict a person's future performance in 
some educational program or in some vocation (to be discussed in Chapter 
6). 


N 


achievement tests that measure a person’s present knowledge or level of 


performance in order to appraise individual or group success in past learning 
activities (to be discussed in Chapters 11-13). 


Achievement tests are more heavily weighted with tasks that measure 


the students’ learnings in specific courses, while aptitude tests include novel 
tasks and/or tasks with which all students are likely to have had previous 
experience (through learnings outside school and through a common core 
of required subjects). 


Some tests are not easily classifiable as aptitude or achievement tests. 
For example, achievement tests that measure student's progress toward the 
over-all goals of the educational program, and scholastic aptitude tests 
that are designed to measure cognitive abilities developed through schoo! 
experience, occupy an intermediate position. In Figure 5.1 we have at- 


155 


Measurement Principles in Test Selection 


'xipueddy 
BYE U! 5524 јо si] payissp[» aui oj реллејел sı лерред əy} ‘WAY Buju1e»uo» чоцошлоји! PUD SISO} jO зони 
3ie|duio» 404 “sez "d (0961 "эш 'saeusi|qng ‘Moy эў ледлон ород MAN) Bunse| јозбојоцолва jo spuuess3 
^uooquo4) əə] рир ‘ogy 'd (1961 'Aupduio5 џојишоруу əy, [00A мем) ‘Рә PC “Buysay |poiBo[ouoAsd 
']Sbjsbuy опишу шолу зар Лирдо UO змоцо лојиш5 Aq рәұѕәббпѕ som HY siy} jo ұџәшаојәләр әчү 


404204 
|puoup3np3-[bq48A eui uo sispuduig sse] Jo 1340319 Ајелцојед Bunueseddey sis] Aupiqy parejas s “Bld 


P3-AUO сән 
бшроор амовалбола 
шшш џелон 
13437 5204204 |02п4] по 


40 5409) )9 ezijonba 
923UDUIJOJ 194 WOOL 40 221 4044 54591 
JNypy. vus 


(saan4ino Buiyoads 

5119; —ysij6ug ui ѕәоџә!ләдхә 
nea SIYM WAL yooyosuou y5hosy4 Ajab} 
SIADQ padojanap у зец 140 
емалибоо үозәиәб jo 54591 

2SIM 10 (рәдојәләр 
pug 1095 — 00425 Ájeb1o|) ѕә14.11190 
рој 045 омдчбоо joseuab yo 54591 

ldV| 175 


51591 Ss2201d јоџоцоопре jo 5006 
8315 ојошууп рзомој у мојб 


азп5раш ој peubisep 5459] 
juauidojanag 


не: SaaS M a б 
0 5459] DMO i | abpajmouy 
30 $159] 1 JU8U| 51591 40 001402118до puo 
-әлә!цә\] омот uoisuayaidwos би воцаше 
Vus 54584. | сиоцо S4534 19,40u - j2e[qng 


591 
pə4uəuo | -шшохә 
uawanaiyov | asnon] _ зејпоз 


p3-A ш UpI[odo1Jaw эм |“ арош 


upoo] -01г0002 | -1943031 
шпшхоуј 


54594 JUaWAAaIYD 
4a4,4DUI-J9afqns оиоциролі 


THE EVALUATIVE PROCESS 
156 


tempted to portray, through the use of a number of examples, the way in 
which different standardized tests represent varying amounts of emphasis 
on the verbal-educational factor.” 


INVENTORIES OF PERSONALITY, INTERESTS, OR ATTITUDES Many tests 
in this area are really questionnaires involving self-report or self-descrip- 
tion. In fact, the term "inventory" rather than "test" is being increasingly 
used. 


In inventories of personality, interests, or attitudes (or in the observation 
and rating of these traits in real-life situations), 


1. The goal is to sample the individual's typical performance, and ће is encour- 
aged to react in any way that is natural or typical for him. 
2. Examinees can usually change their responses at will. Whereas an examinee 
cannot fake answers to an ability test, he can usually fake his replies to 
inventories of attitudes, interests, or personality traits. Every effort is made 
to have the individual perceive the testing situation as one in which he can 
feel free to show, or report, his typical behavior, feelings, and attitudes. 
Ina counseling situation, the inventory is presented as an aid to counselor 
understanding and to self-understanding. In an employment situation, the 
technique is presented as an aid in maximizing the examinee’s job satisfac- 
tion through placing him in a job best suited to his temperament or interests. 


Sometimes the purpose of the inventory is disguised, as when interest inven- 
tories are scored by personality-trait keys. 


3. Interpretation of results for individuals is based on the assumption that all 
respondents are equally willing to 


Teport typical behavior; or provision is 
made to correct individual scores for tendencies toward deception and test- 
taking defensiveness. 


4. There are usually no external standards of correctness, the responses being 
summarized in terms of categories.7 
5The term used b 
developed learnings. 
И is recognized that test users var 
Sponses to personality inventories as (1 
responses, or (2) Symptoms that havi 


y British factor analysts to Tepresent competency in school- 


y in the degree to which they interpret re- 
) having face value as honest and insightful 
r e been shown empirically to be related to 
job success, job satisfaction, probability of improvement under therapy, or some 
other criterion. Even where responses to inventories are scored merely as symp- 
toms, however, the assumption is made that examinees are responding to the 


questionaire in a manner that is typical of them. In other words, the person who 
feels impelled to give many of the “ 


30) Bood-impression" type of responses in an 
employment situation is assumed to Show similar behavior in other situations in 
which he would feel vulnerable or on the defensive. 

7 When an inventory is to be scored in terms of an individual's place on a 
desirable-undesirable continuum, such as neuroticism or predicted success in a 
specific job, external standards of “correctness” do exist, In such situations it is 
recognized that many examinees will strive for a maximum score; hence test items 
that are not easily faked are used and/or corrections are made for each individual’s 
tendency to fake. 


Measurement Principles in Test Selection 157 


EVALUATING TESTS FOR USE FOR SPECIFIC PURPOSES 


The principles of measurement (discussed in Chapters 2-4) become more 
meaningful when students have experience in applying them to the evalu- 
ation of specific tests for proposed uses. Ordinarily such experiences are 
most valuable if the student evaluates tests within subject areas in which he 
is teaching or plans to teach.* 

It is when the student evaluates tests in his own subject field that he 
can most effectively judge the content validity of tests. Moreover, am- 
biguities in the wording of items, poor selection of distractors (false 
alternatives), and other weaknesses in test items are most easily detected 
in one's own subject field. The prospective counselor should evaluate tests 
that will aid him in helping students to make wise decisions in their life 


planning. 
Table 5.1 


Summary of Data Needed in the Appraisal of a Standardized Test? 


REFERENCE DATA 


1. Title. 

2. Names of major subtests: - 

3. Author(s) 4. Publisher... ———————— 
5. Range in grades. —— 6. No. of forms. ——— 

7. Purposes for which test is recommended by author: 


8. Intended use (purpose and group for which test is being evaluated): 


CONTENT VALIDITY (especially important fi 
9. Abilities and skills that this test is designed to sample. И — 
10. Bases for selecting items (Sources of items and criteria for inclusion). 
11. Your comments regarding the appropriateness of the item content for your local 
curriculum or for the specific purposes for which you will use test. 


CONCURRENT AND PREDICTIVE VALIDITY (especially important for aptitude tests and 
for other tests used to assist in selection ог placement) 

dies relevant to your intended use of the test. 
ding number of cases and significant charac- 
Its of validity studies.^ 


12. Summarize results of statistical stu 
NOTE: Report criterion, data regar 
teristics of validation group, and resu 


5 Many universities and colleges maintain files of standardized tests for use by 
Students in measurement classes. ТЕ such а file is not available, or if the file does 
not include tests in a student's particular field of interest, the student in a measure- 
ment class is given the privilege of ordering directly from test publishers specimen 
Sets of tests that he wishes to evaluate. Such requests, of course, must be counter- 


Signed by the professor, who will check the request for its appropriateness, 


158 THE EVALUATIVE PROCESS 


Table 5.1 (Continued) 
Summary of Data Needed in the Appraisal of a Standardized Test? 


Řasy 


OTHER EMPIRICAL EVIDENCE OF VALIDITY 


13. Summarize other validation studies that help in the interpretation of test data 
for intended use. 


14. Your comments regarding statistical evidence of validity of test for intended use. 


RELIABILITY (for total scores and for any subscores that will be interpreted) 

15. Evidence of equivalence or internal consistency (that is, consistency of perform- 
ance on specific content samples).c 

16. Evidence of stability (consistency over time).4 

17. Comments regarding adequacy of reliability for intended purpose. 


18. Types of converted scores. 

19. Availability of multiple norms for homogeneous subgroups (for example, by 
Sex, age, occupation, curriculum, and the like). 

20. Adequacy of norming sample(s). 

21. Recency of norms (date of latest revision). 


22. Your comments regarding adequacy of norms for intended use. 


PRACTICAL CONSIDERATIONS WITH RESPECT TO ADMINISTRATION AND USE 
23. Complexity of administrative 
24. Time requirements. 
Working time for students 
Total administration time 
Is more than one testing session required? 
25. Scoring. 
Have adequate procedures been used to minim 


If the Scoring is not entirely Objective, 
for scoring? 


Are any Special qualifications required for scoring the test? 
26. Aids to interpretation, 


Can raw scores be easily translated into converted Scores appropriate to your 
purpose? 

Are special forms (such as 
of results? Are these for 


process for examiner and students, 


ize scoring time? 
does manual provide adequate directions 


Profile sheets) provided to aid in the interpretation 


ms so designed as to help the user consider errors of 
measurement? 
Does the manual provide sound and helpful aids to interpretation and valid 
suggestions for use? 


27. Cost of testing. 


NOTE: Consider not only cost of booklets but whether such booklets can be 
re-used with consumable answer sheets. Consider also clerical time in- 
volved in scoring. 


Measurement Principles in Test Selection 159 


Table 5.1 (Continued) 
Summary of Data Needed in the Appraisal of a Standardized Test 


COMMENTS OF REVIEWERS (See test list in Appendix for references to reviews in 
Buros Yearbooks) 


YOUR OVER-ALL EVALUATION OF THE TEST FOR INTENDED USE 


“In order to conserve space in printing, no space has been allowed for filling in data or 


comments. The student would merely incorporate the headings in his typed report. 

"In interpreting validation data, take the following factors into account: (1) the criterion 
variables used (evidence concerning their reliability and probable relationship to ultimate cri- 
terion); (2) time elapsed between administration of predictor test and obtaining of criterion 
scores; (3) evidence of possible criterion contamination (for example, test data being available 
(4) characteristics of validation group (number of cases, 
(5) whether the test was cross-validated, that is, validity 
on of test items; 
e with which he 


to persons making criterion ratings); 
M and SD of test and criterion scores); 
coefficients were computed on a different group than the one used in the selecti 
(6) whether data are given that would enable the user to judge the confidenc 
can estimate criterion data. 

“In interpreting data, note the number of cases, the method used, and the results. Also 
note whether the reliability coefficient was computed on groups that are about as homogeneous 
аз the groups on which a test is typically used; (for example, a median reliability coefficient for 
several school-grade groups is desirable, rather than a spuriously high reliability coefficient, 
computed on all such groups combined). 


a А " 
In interpreting data, note procedures, 
the results, In addition, note the time interval between the two administrations o! 


characteristics of group used in stability study, and 


f the test. 


In Table 5.1 is presented a form for summarizing data about specific 


tests. In this outline, we have reminded the student of specific points 
studied in Chapters 2-4 that are relevant to test selection. In each section 
of the form, the student is asked to keep in mind his intended use (or uses) 
of the test data. 

The first major topics in the outline are concerned with validity. Validity 
for the intended use is the sine qua non of any test. An achievement test 
that does not test learnings relevant to the goals of instruction may give 
misleading results; an aptitude test that does not have predictive validity 
may lead to the wrong decisions. 

Validity actually includes reliability as well as relevance. A test must 
have a fair degree of reliability, that 15, must measure some attribute with 
fair consistency, in order to provide а dependable basis for any type of 
judgment. Reliability is a necessary, but not sufficient, condition for valid- 
ity. Following validity and reliability, the outline considers norms and 
practical considerations with respect to administration and use. Each major 


Section of the outline will be discussed in turn. 


160 THE EVALUATIVE PROCESS 


Content Validity 


In studying the content validity of a test for one's own purposes, it is 
essential to study the test and manual carefully. Check the manual to see 
if it provides a classification of test items to help the user judge content 
validity for his own purposes. Determine how closely the distribution of 
items (by content area and type of objective) agrees with the proportional 
emphasis desired. Check the test itself to determine what percentage of 
items appears valid for local use, to ascertain whether items seem to be 
well-constructed, and to determine the percent of test items that test un- 
derstanding, rather than just requiring the students to recognize memorized 
content. 

In the selection of achievement tests, face validity is important, that 
is, the extent to which a test appears to measure relevant information and 
abilities. Especially when an achievement test is used as part of a final 
examination, it is essential that the students feel that the test is fair in 
that it emphasizes what they have studied. A teacher can judge face valid- 
ity, and also the appropriateness of the difficulty level of the test, only by 


an actual examination of test items. In fact, it is a good idea for the teacher 
to take the test. 


Concurrent and Predictive Validity 


Validation data of these two types are especially important when tests 
are to be used in making decisions about students or in improving the 
informational basis for decisions by students. We are especially concerned 
with these types of validity when decisions among alternatives are being 
made. 

Examples of institutional decisions about students include the selection 
of students for college admission or other programs admitting limited num- 
bers and the placement of students in ability groups. For such uses, local 
validation data and expectancy tables are indispensable, However, one can 
select tests that seem promising for local use by searching the test manual 
(or a technical supplement) for correlations between test scores and the 
criterion data in which one is interested, obtained on groups of students 
similar to local groups. 

A test will not have high concurrent or 


purpose unless the difficulty level is appro 
to differentiate among low- 


predictive validity for a specific 


el i priate. If one's chief purpose is 
achieving students so as to select those who 
should be assigned to remedial classes, it is best to Select a test “ 


; Е = реакеа” 
at a low level of difficulty—that is, with a large number of items that would 
constitute a good test for that group. In a test with insufficient * 


test floor," 
many students who would obtain scores scattered over а considerable 


Measurement Principles in Test Selection 161 


range on an easier test will obtain zero or near-zero scores. Such a test is 
of no value in differentiating among low-achieving students; we can have 
little confidence in predictions based on their scores. 

On the other hand, if one's chief interest is in selecting students for an 
accelerated class, the test should be “peaked” at a high level of difficulty, 
containing a large percentage of difficult items. On a test with only a few 
difficult items, or too low a "test ceiling," high-achieving students (whose 
Scores would scatter over a considerable range on a more difficult test) 
pile up on the high end of the scale, as in the arithmetic test (Figure 2.4). 
Such a test is of little or no value in differentiating among high-achieving 
students. 

If we want to do a good job of measuring individual differences through- 
out а wide range of ability, we must use а fairly long test with a wide 
spread in item difficulty; or we can divide our group of examinees, with 
one group taking a higher-level test and another group a lower-level test.? 


Other Empirical Evidence of Validity 


In this section of the outline, one would summarize other validation 
studies that help the user to interpret the meaning or significance of test 
Scores. Construct validity studies help in identifying the many factors that 
contribute to individual differences in test scores. Such studies should help 
one to determine whether test scores are influenced unduly by student dif- 
ferences in vocabulary level, speed of working, and other factors unrelated 
to the trait measured. In interpreting test scores, it is helpful to have research 
findings on whether or not significant differences are found between aver- 
age scores made by boys and girls, by students from differing socioeco- 
nomic backgrounds, or by students who have or have not had specialized 
instruction, Answers to many such questions aid the test user in inter- 
preting what the test scores mean. 

Тће type of validation studies nee 
example, for personality and interest invento 


ded varies with the type of test. For 
ries, it is valuable to have 


wanted to simplify testing programs and get 
"maximum returns" for their testing money and testing то, they have encouraged 
publishers of achievement tests to spread their items very thinly Over ie wide range 
of both content and difficulty. As а result, we may have so few items appropriate 
for the lowest grade level of a multigrade range that a chance score (obtained by 
marking responses at random) can bring students almost up to grade level. Greater 
awareness of this problem, and the use of two or more tests (of different ranges 
in difficulty) seem essential if meaningful scores are to be obtained for all students 


in the usual heterogencous groups included in city-wide testing programs. The 


practice used in the STEP tests, of planning directions and time limits so that tests 


of two or more levels can be administered at the same time, constitutes a useful 
precedent. This problem is discussed further in Chapter 13. 


9 Because test purchasers have 


162 THE EVALUATIVE PROCESS 


research findings on the effectiveness of the test in predicting relevant 
havior outside the test situation; for achievement tests, one seeks data on 
i extent to which scores obtained by multiple-choice questions agree 
with results obtained by free-response tests of equivalent content. For any 
tests that are to be used in making inferences regarding intraindividual 
differences, data regarding the homogeneity of subtests and intercorrela- 
tions between subtests should be examined. | 
The topic of construct validity is so complex that the student is referred 
to the last section of Chapter 4 for a review of all the types of evidence 
*hat would appropriately be included in this section of the outline. | 
In evaluating evidence on concurrent, predictive, and construct validity, 
one must examine tables of validity coefficients for each test, select those 
coefficients that are most relevant to the intended use, and then compare 
them. Making sound inferences from comparisons of validity coefficients, 
however, is not an easy undertaking. In general, the higher the validity 


coefficient, the better; however, many factors have to be taken into account 
in making comparisons: 


1. Criteria differ 


For example, when the scores of college freshmen on the Davis Ке 
Test were correlated with grades in English, the average r was approxi 
-50; when scores on this Same test were correlated with the STEP R 
Test, the correlations were .76 for the reading level scores and .81 

speed scores.?? The lower correlation with teachers’ marks is due, in part, to 
the lower reliability of the criterion and, in part, to the fact that many 
factors other than reading ability affect Brades in English. Some tests lend 
themselves to easy validation; for others, such as the personality inventories, 
adequate criteria are difficult to find. For Some tests the only feasible criteria 
are ratings (which are admittedly subjective and unreliable). One cannot 
expect high validity coefficients for this type of test. 


2. Groups differ 


ading 
mately 
eading 
for the 


The groups on which the correlations are 


computed differ. Correlation 
coefficients tend to be lower for more homog 


eneous groups.!! 


3. When we attempt to interpret validity coefficients in terms of their value for 


us—in making the judgments we wish to make—we must ask ourselves how 
much additional information the test gives. For example, if we are trying to 


10 Frederick B. Davis and Charlotte C. Davis, 
York: The Psychological Corporation, 1962), pp. 22, 26, 
11 For a discussion of the relationship 


between the heterogeneity of groups and 
the relative size of correlation coefficients, the reader is referred to Chapter 3, page 
94, 


Manual, Davis Reading Test (New 


Measurement Principles in Test Selection 163 


predict success in algebra, and we already have the results of a fairly recent 
intelligence test on the cumulative record, we may be more impressed with 
validity coefficients of .40 to .50 between an arithmetic test and algebra 
grades than validity coefficients for another test of general scholastic aptitude, 
which range in the .50-.60 range. The reason is that the arithmetic test 


provides new information. 


Reliability 


In evaluating a test, our requirements with respect to reliability depend 
оп the type of comparisons we wish to make. Comparisons between 
groups make the lowest demands on reliability; comparisons between in- 
dividuals require greater reliability. If we wish to measure gains for 
individuals, we are dealing with difference scores that include the error 
Variance of both tests; hence for such comparisons, reliability needs to 


be very high. . аи | . 
When we wish to study intraindividual differences in diagnosis or guid- 
irly low correlations between sub- 


ance, high reliability of subtests and fai 4 d 
tests are essential. If test scores are to be used as a basis for making 
Profiles and studying intraindividual differences, the manual should pro- 
Vide information concerning the reliability of differences. Profiles should 
be designed so as to minimize the risk of the user's attaching significance 


to small differences that might be reversed in direction on retesting. An 
example is the profile form for the Differential Aptitude Tests (Figure 
the height of two bars may be 


6.2), on which a one-inch difference in à І 
interpreted as representing a reliable difference. Another technique is used 
in the STEP profile (Figure 5.2). 

Another factor that affects th 
finality of the decision and the use О 
We are excluding students or applicant 
bility of scores should be very high; if \ 
liminary screening device and will examin 
for those students whose scores do not give а clear- 
reliability is acceptable. 

If we are making decisions about 


e degree of reliability required is the 
f other data in decision-making. If 
s on the basis of a single test, relia- 
if we are using the test as a pre- 
e other test and non-test data 
cut prediction, lower 


people as in college admission, ability 


grouping, and the like, minimum reliability requirements и be higher 
than if we were selecting a test to be used by people їп making their own 
decisions. When a person is making his own choices, we can remind him 
of the standard error of measurement, and we can encourage him to taky 
Other data into account. Under such circumstances, we can accept tests with 
somewhat lower reliability than would be usable for institutional decisions. 
The student is referred to Chapter 3 for further suggestions regarding the 


interpretation of reliability coefficients. 


or 


or 


or 


or 


[4 


о! 


vé 


Yz 


as 


Lot 
Yz 
ba 5) [S 


ы PUN n 


POR нада — 


| хо] [WT] изд 


ras 


aL 


ss3uo03d 1vNOl1V2nd3 зо 5153] IWILNINOIS 


атчона 1мзап15 4315 


хайаа, 
Ven pora 


54025 ONLLTHAWSINI WOT 
TVONVIN 4315 uoto ur рәшепоз әле кшоптзәлфәуш jo suGmssnosip рарезар OJ 


эзиә1э$ ш Suipuris sry шїп 2ou3np 81 sorpmig [2120$ ш Surpumis sauapnis 
эца fdepəao 10u op зотрпзб [9506 pur эз зо} spurq Соломон "aouong рит 
sonvunqeJN Jo зала st wes aq], "sase ома osa ш sSurpueis sjtiapnis atr uaaauaq 
эзпәзәрир aurizodur ou st әзәфз dr[r240 solpmig [ei»og pue вопешәцерү 20j врше 


251+ (az) 202195 
1-09 (уг), ipis [ros 
2906 (VO) ®птшәртуү 2 
fio) 2211 Jo} spurq ojnu22:3d suuspmis т ‘susou [eoo оз Suipioooy :e»durexq 
"рока 1240] ap Aq paruassidas Surpuris wey 221129 Al[ca4 st pueq 2948! и] э ла 
poiuosoados Suipueis dejao зош ор sisa) ома Aue 20р sputq o[nu»axd o JI '2 
"1522 ом1 21091 UO sSurpuris s,u2pnis әң U2o319q 
22u212gtp ашєцойшт ou et asaya “їєрэло $151 ома Aue 30} грива o[nuomod 252 JT "T 
#4184 sona 
тз эдчәрпдз т азеашоз o, 


«porn тюз оца 20} sjenueu oq зовоо "94006 

ONLIIdWSLENI YO TVANVIN 4415 4252 ш! рәрпэш әле woz 4'140Н4 I 
Wo spüeq ojnuoajod Зшмезр рие џоптишоји Зигрзозом Joy suonoaiqp "проту 

чщшош ¢ 10 # Jo pouad v шим pozoistumupr aq poys papnjour 

рел за оз frare әз UIIMIEG зиозитйшоэ зло зо} зәрго UT "шәт 4415 
0521 xw $t Апуш ве по sues s[nuo3/od sauapms v a]yord што под эзәр 


суре "да eopitoy so f" soie oag Song, pesoneospg @ топа мо, oanuadoo 
poasotar йн рү 4561 зчаи/аоо O 


suo Б 7 — me 
W 4321) 30 әреге) de РЈ чита А) 
рођ euet 
эд, fonds зо пез spy жэл 


ПЕ-та МӘ mna ew Y 
step) 10 зрео) : kaa 

m m SR РАЗУМ 
“З X33838 ома мм N 


juepnig гролб 
-уџалејд |ооцәціоахн 
D 20} 5,59] 4315 Sy} 104 
әјуола ганодеп ||| g'g “Bly 


Measurement Principles in Test Selection 165 


With respect to methods used in estimating reliability, it is ideal if two 
or more different methods have been used. By comparing the results 
obtained by different methods, we can estimate how much of the variance 
is due to the specificity of forms, how much to temporal variations in 
general characteristics of individuals, and how much to temporal variations 
in reactions to the specific test." 

If alternate forms of a test are to be used to measure growth, a co- 
efficient of equivalence (obtained by the alternate-forms method) is essen- 
tial; an internal consistency coefficient is not an adequate substitute. 
Internal consistency methods must not be computed for tests in which speed 
of working is an important factor in test scores. 


Norms 


Many standardized tests provide more than one type of converted score. 
For example, many achievement test batteries provide grade placement 
norms so that school averages can be compared with national norms in 
each subject area; and in addition, they provide percentile norms so that 
results for individuals can be interpreted in terms which students and 
parents can easily understand (provided that confusion with "percentage 
Correct" scores is avoided) .** 

Normalized standard scores are becoming more widely used. Standard 
Score units represent equal differences in raw scores throughout the score 
range. In other words, if three students each gained ten T-score points 
during the year, these gains would represent approximately the same gains 
in raw score, regardless of whether the student was initially in мена 
Scoring, average-scoring, ог high-scoring group. This statement would not 
be true for PR's; in fact PR's are unsatisfactory asa measure of growth. 
Stanine scores combine the advantages of equality of units and ease of 


interpretation to lay persons. 


If a test provides only one type of converted score, one must make sure 


that this type is adequate for the intended use. As we have mentioned, if 
à test has only percentile norms, interpretation of gains dung. а Боан 
year is difficuit, A test that provides only grade placement norms may 


oblem in which the four types of variance 
flicients of "equivalence," "stability," and 
"equivalence and stability." Lee J. Cronbach, Essentials ој Psychological Testing 
(New York: Harper & Row, Publishers, Inc., 1960), рр. елны | 

18 Ideally, percentile norms for school averages a т also ђе provided. School 
averages do not vary as widely as scores for ыш so example, i the aves 
Teading score for a school corresponded toa PR of | 3 on the 2 i. in “re 
that school would probably be achieving at а hig T average level than 
Percent of the schools included in the norming ѕатр ©. 


12 Cronbach presents an illustrative pr 
are estimated from data concerning соё 


166 THE EVALUATIVE PROCESS 


give a misleading impression of the level of competency, and the relative 
competencies in subtests, for many students. However, if the test is satis- 
factory in other respects but lacks the type of norms desired, the develop- 
ment of local percentile and/or standard Score norms might be advisable. 

In appraising the adequacy of the norming samples used, one should 
consider not only number of cases but the procedures used to ensure that 
the norming sample is representative of the defined population. In norm- 
ing achievement tests, the number of different communities used and their 
representativeness by geographic region, socioeconomic class, and other 
factors are important. 

For some uses, the recency of norm revision is very important; for 
others it is less crucial. When one wishes to use "national norms" as a 
basis for evaluating achievement, a revision within the last eight to ten 
years is advisable, since publishers have used much more adequate pro- 


cedures within the past decade for obtaining representative samplings of 
the general population. 


At least three 


Practical Considerations with Respect to Administration and Use 


In addition to the major criteria discussed a 
of minor criteria that should be c 
choice among two or three instru 
reliable can be made on the basis 
cost, their mechanical make-up, 
the ease of administration and scoring. 


A number of factors affect 
ty of instructions to examiner and sub- 


Clear directions are indispensab 


aminer and the directions to pupils are clear, the test is not sufficiently 
well standardized. Under these circumstances, one cannot be sure that he 
is giving the test under the same conditions 


Measurement Principles in Test Selection 167 


type of item is unfamiliar or if the directions are complex, more than one 


example should be given. 
Even before one orders specimen sets, one can usually determine from 


the publisher's catalog whether special qualifications are required for ad- 
ministering a test and/or interpreting the results. Several test publishers 
are now classifying their tests according to levels, as recommended by the 
American Psychological Association. These levels are: 


LEVEL A Tests or aids that can adequately be administered, scored, and 
interpreted with the aid of the manual . . . (for example, achievement or pro- 
ficiency tests). 


LEVEL B Tests or aids that require some technical knowledge of test con- 
Struction and use, and of supporting psychological and educational subjects 
Such as statistics, individual differences, and psychology of adjustment, per- 
Sonnel psychology, and guidance (for example, aptitude tests, adjustment 
inventories with normal populations). 


LEVEL C Tests and aids that require substantial understanding of testing 
and supporting psychological subjects, together with supervised experience in 
the use of such devices (for example, projective tests, individual mental tests) .!* 


EASE AND OBJECTIVITY OF SCORING If the burden of scoring and com- 
puting converted scores is too great, the teacher has relatively less time 
for interpreting test results, For this reason, publishers are making every 
effort to reduce the scoring burden. р . . , 

The use of test-scoring equipment, now available in many city А. 
County school systems, requires special answer sheets, which are not ordi- 
Narily used below the fourth grade. Two leading types of scoring services 
are available: (1) the IBM Test Scoring Machines, which can be pur- 
Chased or leased by school districts and (2) central scoring а that 
Utilize special high-speed equipment, obtaining and ез а ты 
through a single "reading" (photoelectric scanning) of тА sides of an 
answer sheet (which can accommodate almost а opi er items). 

One of the chief advantages of IBM scoring is that sc m istricts can 
lease their own equipment, which can then be available for the scoring of 
teacher-made tests. One of the chief disadvantages 15 that answer sheets 
Must be inspected to make sure that answer spaces have been fully 
blackened in and that all stray pencil marks and smudges have been 
removed. When sheets are well marked, 500 tests an hour can be scored 


With IBM machines. 


14“APA Code of Standards for Test Distribution," American Psychologist, vol. 
5 (November 1950), pp. 620-626. 


168 THE EVALUATIVE PROCESS 


The photoelectric type of service is especially economical for achieve- 
ment test batteries with several scores. Such a machine can score approx- 
imately 6000 answer sheets per hour, printing both raw and converted scores 
on a summary sheet. Answer sheets must be mailed to Scoring centers. | 

An example of centralized photoelectric scoring is the National Guid- 
ance Testing Program of the Educational Testing Service, available for 
grades 4—14. A school district can obtain from one to nine scores for each 
student. The tests are scored and the program results are listed for each 
group. The basic service (scoring answer sheets, preparing list reports, 
permanent record slips, and individual report forms) is provided for a 
small per-pupil fee, regardless of the number of tests the student takes 
from the SCAT and STEP series. This example is given to illustrate how 

modern scoring and computing services make it possible to get many scores 
as cheaply as one or two. All leading test publishers now offer a similar 
type of scoring service for the tests they publish. Some publishers also 
offer hand-scoring services for tests given to pupils in the primary grades. 

Some state universities, and some state and county departments of edu- 
cation, offer low-cost scoring services to school districts. A school that 
intends to do its own hand or machine scoring can estimate how time- 
consuming the scoring process is for each of several specific tests by com- 
paring the fees charged by some nonprofit testing service, such as a state 
university that maintains scoring services for a wide 

Teachers have criticized machine-scoring methods because they do not 
indicate which of the students’ answers are incorrect. Carbon-backed 
answer sheets have been developed to meet this objection. With one of 
these techniques, the Scoreze, developed by the California Test Bureau, 


the student marks his responses on a standard machine 


-Scoring answer 
sheet, which can be detached and scored by machine if such equipment is 
available.1* 


variety of tests.” 


MECHANICAL MAKE 
be very important in 
The format should b 
grade level. The qual 


-UP The proper mechanical make-up of a test may 
its indirect effect on the validity of students’ scores. 
e attractive, and the size of type appropriate to the 
ity of pictures and diagrams is very important. 


15 An example of such a publication 
and Rental Service (Champaign, IIl.: University of Illinois, 1955). 

16 A somewhat different process has been developed by Harcourt, Brace & World, 
Inc. Students mark LDP answer sheets; the letters stand for the “Liquid Dupli- 
cating Process." The answer sheets are sci 


н огей by Overprinting the correct responses 
with a liquid duplicator. For some of t 


5 : i ‚ОЁ the subtests, the master stencil overprints, 
at the same time, information pertaining to the skill or knowledge being measured 


by each test item. To date, this process is used only with the Stanford Achievement 
Tests, Elementary, Intermediate, and Advanced batteries, 


is Unit on Evaluation, Test Scoring Service 


Measurement Principles in Test Selection 169 


NUMBER OF EQUIVALENT FORMS If the teacher wishes to measure 
growth in achievement by pre- and post-testing, it is necessary to administer 
two forms of a test, designed to be parallel in content and equal in dif- 
ficulty. If a school staff plans to test intelligence or achievement at yearly 
intervals and compare the results for individuals and groups, it is important 
to select tests that have at least two equivalent forms. Several standard- 
ized tests have three to five such forms, equivalent in difficulty and designed 
to represent closely parallel samplings of skills and understandings. 


AIDS TO INTERPRETATION Many publishers have developed excellent 
materials to aid teachers and counselors in the interpretation of test scores. 
The test authors are specialists in their own test and should be able to 
Offer good advice concerning the ways in which the scores can best be 
interpreted and used. 

Manuals for the Sequential Tests of Educational Progress (STEP) and 
for the Evaluation and Adjustment Series of high school tests! provide 
item norms so that teachers can compare the achievement of their classes 
оп each item with that of the norming sample. The STEP Teacher's Guide 
also provides suggestions for the discussion of test results with students. 
The manual for the Metropolitan Achievement Test is rich in sound sug- 
8estions for the use of results from this battery. The California Achieve- 
ment Tests provide diagnostic analyses that assist teachers in obtaining 
diagnostic clues from student performance on small groups of similar 
items. These leads, however, should be checked with other sources of 
information, such as teacher-made tests or analysis of the student’s work 
In the subject area. а 

Publishers of aptitude, interest, and other predictor tests have Yos 
expectancy tables, multiple-group norms, student Раца пена, eam 
to interpretation, and the like. For example, the publishers я. е 
Preference Record (an interest inventory) have developed e ent os 
lists on the different occupational fields, filmstrips and boo == to er in 
the interpretation of results, and an “Occupational Counseling Review Set 
for the counselor. " 

АП Qe mes need to be evaluated, not just in terms of their number 
and attractiveness but in terms of whether they Mie | PI 
decision-making. Here, the opinions of test specialists, considered in the 


last section of this chapter, are of great value. 


COST OF TEST SUPPLIES AS а principal contemplates his dwindling 
Supply budget, the relative cost of different tests may Fe poen краш 
factor im selection. However, one test may be costly p 


17 These series are discussed in Chapter 13. 


THE EVALUATIVE PROCESS 
170 


copy, whereas another one is cheap at fifteen cents, To waste teacher and 
student time on the administration and scoring of a cheap but inadequate 
test is poor economy. Since adequate standardization of tests is very ex- 
pensive, the cost of tests involves more than the cost of paper and print- 
ing. Moreover, a more expensive test may include devices that reduce 
scoring time. The cost of scoring, which has already been discussed, may 
be a more significant factor than cost of materials. 

For students in the upper elementary and the secondary grades, economy 
may often be effected by using separate answer sheets obtained from the 
publisher and reusing the more expensive test booklets again and again. 


Many publishers permit schools to lease tests; such an arran 
provide an economical means of try 


to be used in a special research study 


gement may 
ing out a new test or of obtaining tests 


ILLUSTRATIVE USE OF THE SUMMARY FORM 
WITH A STANDARDIZED TEST 


Before the student uses the su 
interpreting data regarding test 
example of a completed form 
only the major headings of the o 


ggested summary form in compiling and 
5 of his own choice, he should study the 
provided in Table 5.2. In this example, 
utline have been used. 


Table 5.2 
Summary of Data Needed in the Appraisal of a Standardized Test 
(Form filled in for the Scholastic Aptitude Test) 


eee 


REFERENCE DATA 
1. Title 


Scholastic Aptitude Test 
2. Names of major subtests: 


Verbal skills, mathematical skills 
Staff, Educational Testing Service, with the advice of a committee 


. Author(s) of examiners in aptitude testing 
. Publisher Educational Testing Service 
. Range in grades: 


Applicants for admission to college 
New form developed each year 

Purposes for which test is recommended by author(s): 

“The specific job for which the Scholastic A 
vide an indication of a student's ability to do college work. (It was not, and is 
not, expected to take over the whole jo 

More precisely, the test is a measure of t 


and mathematical skills that are necessary to perform the academic tasks re- 
quired in соПере.”а 


No. of forms 


yanas 


Measurement Principles in Test Selection 171 


Table 5.2 (Continued) 
Summary of Data Needed in the Appraisal of a Standardized Test 
(Form filled in for the Scholastic Aptitude Test) 


eS 


CONTENT VALIDITY 

This test is not intended to sample all aspects of intelligence, or even of scholastic 
potential. Many studies have been made over the years to improve the balance of 
content so that the test is not biased in favor of men or women, or in favor of 
students majoring in either the humanities or the mathematics-science curricula. 


CONCURRENT AND PREDICTIVE VALIDITY 

The criterion for validating the SAT tests has usually been grade-point average 
during the freshman year of college. However, in some studies, the four-year college 
average, graduation уз. nongraduation from college, and grades in specific academic 
subjects have been used. 

Studies of the relationship between SAT scores and college grades reveal a sub- 
stantial relationship between SAT scores and subsequent academic performance in 
many different types of colleges, with verbal scores providing the better prediction 
in some curricula and mathematical scores in others. Separate prediction studies 
have been made for different types of colleges and different types of college curri- 
сша. Validity coefficients as high as .60 with freshman grade-point average are 
Obtained under favorable conditions (that is, when college students follow a rela- 
tively uniform academic program, when college grades are based on extensive in- 
formation about student performance, when grading standards are fairly consistent 
from one instructor to another, and when almost all students are working at a 
relatively high level of motivation). 


Predictive Validity at Different Score Levels 
. Although research data seem to indicate as good predictive validity for low-scor- 
Ing, average-scoring, and high-scoring students, а special study was made b^ 
Whether а high-level form of SAT was needed for the most able students. The | igh- 
level test devised did not show sufficiently improved validity to justify its use; but 
Tesearch on the best way to supplement the SAT with test data that will help 
highly selective colleges in selecting students from among high-ability applicants is 
continuing. 

Predicting SAT Scores from SCAT Scores 


Expectancy tables are available for predictin 
School edition (SCAT), given in the 8th, 9th, 10t 


g SAT scores from scores on the high 
h, and 11th grades. 


OTHER EMPIRICAL EVIDENCE OF VALIDITY 
Effect of Coaching on Test Scores 

An attempt has been made in successive revisions of the SAT to make the test 
as impervious to coaching as possible. Seven research. studies (four made by the 
Educational Testing Service and three by independent investigators) agree that the 
average gain as the result of special coaching is less than 10 points, or 1/10 of a 
Standard deviation. 
Effect of Fatigue on Test Scores 


Research studies have revealed а 
Sections taken toward the end of a day of testing. 


no evidence that students perform less well on 


THE EVALUATIVE PROCESS 
172 


Table 5.2 (Continued) 
Summary of Data Needed in the Appraisal of a Standardized Test 
(Form filled in for the Scholastic Aptitude Test) 


aaam 


Efject of Anxiety on Test Scores 


A half-hour version of the SAT was administered to 2000 students under the usual 
“anxious conditions” as part of the regular administration of tests for the College 
Entrance Examination Board. This same condensed version was administered to 
the same students under “relaxed conditions,” the students being told that the study 
was being made for research purposes and scores would not be reported to the 
colleges. They were urged to do their best, however, since scores would be reported 
to their own high schools. Test anxiety seemed to have no effect on boys’ scores 


but seemed to slightly increase girls’ scores on the mathematics section. There was 


no difference in the concurrent validity of the tests, given under “anxious” as com- 
pared to 


“relaxed” conditions, that is, in the correlation of SAT with high school 
grades. 


Fairness to Students from Different Cultural Backgrounds 


Since the purpose of SAT is to predict the students’ ability to do academic work 
in college, the question investigated was: “Do students from different or under- 
privileged backgrounds do better in college than one would expect from their SAT 
scores” or “Is the SAT really a fair measure of their ability to handle college work.” 

Research studies showed that, despite marked differences in background among 
students taking the test, there was no general tendency for students from different 
socioeconomic backgrounds to do any better or worse in college than the test scores 
predicted. Hence the College Entrance Examination Board concluded that “any 
cultural unfairness to students from less favored backgrounds who seek admission 
to college lies less in the Scholastic Aptitude Test than in the educational and en- 
vironmental inequalities of our society."b 


RELIABILITY 


Reliability coefficients have been computed by the Kuder- 
representative samples of studen 


recent three-year Summary ranged from .88 to .91 for both the verbal and mathe- 
matical tests. The standar 


d error of a SAT score is approximately 30 points or 3/10 
of a standard deviation. 


on the score scale, rather than à precise me; 


е о asurement, is emphasized іп all inter- 
pretative materials sent to students, high schools, and colleges 


College Board scores are re 


В ported on a scale based on а linear transformation of 
raw scores, with a mean of 5 


00 and an SD of 100. The Original scale was based on 


Measurement Principles in Test Selection 173 


Table 5.2 (Continued) 
Summary of Data Needed in the Appraisal of a Standardized Test 
(Form filled in for the Scholastic Aptitude Test) 


students’ score in the April 1941 test administration. As subsequent forms have 
ach new form has been equated to the April 1941 form through 


in which a sampling of new candidates answers a group of 
t can be made for any changes in 


been developed, e 
ап equating study, 
questions from previous forms so that adjustment 


the abilities of the groups from year to year. | 
Norms are available for many subgroups. For example, percentile ranks are avail- 


able for all high school juniors, all high school seniors, high school students who 
choose to take SAT, all high school seniors going to college, and students enrolled 
at various types of colleges. 

The College Entrance Exami 
for admitted freshman students 


nation Board provides distributions of SAT scores 
for all colleges with a large number of candidates. 
Results for these various norm groups are of considerable value to high school 
counselors in interpreting students' scores in comparison with their prospective peers 
in the different colleges they are considering. For example, а boy with a SAT verbal 
score of 500 would stand at the 85th percentile among high school seniors and at 
the 64th percentile among all students who enter college. In one of the moderately 
selective colleges, this student would rank above 47 percent of entering freshmen, 
but in another highly selective college, his score would exceed only 2 percent of 
entering freshmen.^ The sampling error involved in a single test of ability should 
be considered in making such interpretations. 


PRACTICAL CONSIDERATIONS WITH RESPECT TO ADMINISTRATION AND USE 
Time requirements: Three hours 
Scoring: Machine-scored by central scoring service 
Aids to Interpretation к 
An orientation booklet for students, entit 
Scholastic Aptitude Test, including many san 
An interpretive leaflet to aid counselors in 11 А 
parents, entitled Your College Board Scores: Scholas 
ment Tests 
College Board Score Reports: A Guide for Counselors свима 
College Board Score Reports: А Guide for Admissions 


Cost of testing: Paid by the student seeking college admission. 


led А Description of the College Board 


ample questions 
terpreting scores to students and 
tic Aptitude Tests, Achieve- 


on of the publisher, from "The Scholastic Aptitude 
N. Ја Educational Testing Service, 1962), pp. 11- 
i Development Report TDR-63-2 (Princeton, N. J.: 


Source: Data excerpted, with the permissi 
Test,” Annual Report, 1961-1962 (Princeton, 
46; The Scholastic Aptitude Test, 1926-1962, Tes 
Educational Testing Service, 1963). 


„ The Scholastic Aptitude Test,” p. 11 

Ibid., p. 26 T А 

* As ta in College Board Score Reports: А Guide for Mun prog (Princeton, N. J.: 
College Entrance Examination Board, Educational Testing Service, 1962), p. 31. 


THE EVALUATIVE PROCESS 
174 


The authors chose to summarize data regarding the SAT (Scholastic 
Aptitude Test), developed and administered by the College Entrance Ex- 
amination Board. This choice was made because QG) the Board has 
recently issued a summary of their research on this instrument, which 
included a variety of validation studies; (2) the summarization of these 
data would be interesting to college students reading this textbook, because 
most of them have taken the test; (3) the choice of this test avoided the 
disagreeable alternative of reporting inadequacies in some commercially 
published test, which might be corrected within the next year or two, 
Under such circumstances, students would continue to study an erroncous 
test review, outdated by the revision of the test. Since many leading tests 
are in the process of revision, we would prefer to leave the critical review 
of published tests to the student, with the aid of his professor, the Mental 


Measurement Yearbooks, and other sources mentioned in the next section 
of this chapter. 


SOURCES OF INFORMATION ABOUT PUBLISHED TESTS 


In this textbook, we have deliberately minimized the discussion of specific 
tests. In Parts Two and Three, a few of the major tests will be dis- 
cussed; but even here, little space will be given to sample profiles or to 
illustrative items from published tests. The authors believe that students 
should obtain and study actual tests and manuals. To assist them in finding 


tests of interest to them, a very comprehensive classified list of tests is 
presented in the Appendix. 


One desirable outcome of a course in tests and measurements is the 
Student's realization that he needs to consult expert opinion about tests 
in order to supplement his own indispensable study of a test and its manual. 
The student will find that his single best source for critical reviews of tests 
is the series of Mental Measurements Yearbooks, edited by Buros. The 
four most recent yearbooks, from 1940 through 1959, are listed in the 
Selected References at the end of the chapter. 

When the student locates in the Appendix test entries in which he is 
interested, he will note in the right-hand column notations concerning all 
the test reviews that have appeared in the Buros Yearbooks. Reviews of a 
specific test may have appeared in more than one yearbook. Ordinarily, 
a new test is reviewed in the next yearbook that appears after its publica- 
tion. However, Buros was not able to include in the early ycarbooks 
(1938 and 1940) all of the tests then available. Hence, in succeeding 
yearbooks, he has included tests not previously reviewed, as well as addi- 
tional reviews of widely used tests. As a result, the student can usually 
find two or more reviews of any test in which he is interested. In addition, 


Measurement Principles in Test Selection 175 


he may find in the yearbooks excerpts from reviews published in pro- 
fessional journals, as well as references to investigations in which the 
test has been studied. 

The latest Mental Measurements Yearbook (1959) included only tests 
published through 1958; and for some of these, only bibliographical data 
could be included. For references to research studies on recently pub- 
lished tests, the reader should check appropriate chapters of the Annual 
Review of Psychology and those issues of the Review of Educational Re- 
search on psychological testing, The most recent issues on psychological 
testing were published in February 1959 and February 1962; further issues 
will appear at three-year intervals. The articles in the Review constitute 
brief, comprehensive surveys of research studies in each of the various 
fields of measurement during the preceding three-year period. 

Since 1959, a test review section has been included in the Personnel 
and Guidance Journal; since 1954, Personnel Psychology has included a 
section called “Validity Information Exchange,” which reports new data 
on the validity of tests used by personnel workers; in 1956 a similar 
information exchange on normative data was added. Validity studies are 
also summarized in two issues per year of Educational and Psychological 


Measurement. 
A few books on measurement and evaluation in guidance have been 


published. Among the most recent are: 


Froehlich, Clifford P., and Kenneth B. Hoyt, Guidance Testing, 3d ed. (Chi- 
cago: Science Research Associates, Inc., 1959). 

Goldman, Leo, Using Tests in Counseling (New York: Appleton-Century-Crofts, 
1961). 

Rothney, John W. M 
Harper & Row, Publishers, Inc., 1959). 

Super, Donald E., and John O. Crites, Appraising Vocational Fitness, 2d ed. 


(New York: Harper & Row, Publishers, Inc., 1962). 


. and others, Measurement for Guidance (New York: 


Additional books of value in the more specialized fields of aptitude, inter- 
est, attitude, and personality testing are included in the Selected References 


Tor Chapters 6 through 9 of this textbook. m" 
Two books that would be of interest to teachers doing diagnostic and 


remedial work are the following: 


nostic and Remedial Teaching: A Guide to Practice in 


Blair, Glenn M., Diag 
Е v. ed. (New York: The Macmillan 


Elementary and Secondary Schools, re 
Company, 1956). А 
Bond, Guy L., and Eva Bond Wagner, Teaching the Child to Read, 3d ed. 
(New York: The Macmillan Company, 1960). Information is given in 
the Appendix concerning reading readiness tests, reading tests, and indi- 


vidual intelligence tests. 


THE EVALUATIVE PROCESS 
176 


Books are available in several subject fields that focus on measure- 
ment and evaluation: Arny, home economics; Hardaway and Maier, busi- 
ness education; Micheels and Karnes, industrial arts; and several in physical 
education, of which the most recent is by Clarke.'* Several recent year- 
books of professional organizations of teachers contain valuable chapters 
on measurement. The Educational Testing Service has reprinted such a 


chapter on evaluation in mathematics, with an annotated bibliography 
of published tests.'? 


SUMMARY STATEMENT 


In this chapter we have illustrated how the principles presented in Chapters 
2, 3, and 4 should be applied in the selection of tests for specific purposes; we 
have presented a test-evaluation form designed to assist the user in summarizing 
data relevant to test selection; and we have briefly reviewed sources of infor- 
mation that can be used to advantage in the selection process. 

The principles developed in Chapters 2, 3, and 4 cannot be routinely applied. 
Experience is needed in their application to the appraisal of specific tests for 
specific purposes and groups. To aid the student in obtaining experience in 
appraising tests in his major field of study, a comprehensive, classified list of 
published tests is provided in the Appendix; references to reviews in Buros' 
Mental Measurements Yearbooks are included for each test for which such 
reviews were available (as of date of publication of this textbook). 

In this chapter, assistance was also given in understanding the various types 
of tests available for use. The major characteristics which distinguish between 
standardized and teacher-made tests were considered. Tests were next classified 
according to the degree to which they involved indirect measurement or rele- 
vance to the criterion behaviors about which the test user wishes to make 
judgments. Several other bases for classification were clarified, namely: (1) 
group tests vs. individual tests, (2) pencil-and-paper tests vs. performance tests, 
and (3) speed vs. power tests. 

When tests are classified on the basis of their content, the major distinction 
usually made is between tests of ability (in which the goal is to measure the 
individual's maximum performance) and inventories of personality, interest, 
and attitudes (in which the goal is to measure his typical performance). Al- 
though ability tests can be further classified into aptitude and achievement 


tests, these tests actually lie along a continuum with respect to their emphasis 
on verbal-educational factors. 


18 Clara Brown Arny, Evaluation in Home Economics (New York: Appleton- 
Century-Crofts, 1953); Mathilde Hardaway and Thomas Maier, Tests and Measure- 
ments in Business Education, 2d ed. (Cincinnati: South-Western Publishing Com- 
pany, 1952); William J. Micheels and M. Ray Karnes, Measuring Educational 
Achievement (New York: McGraw-Hill Book Company, Inc., 1950); H. Harrison 
Clarke, Application of Measurement to Health and Physical Education, 3d ed. 
Englewood Cliffs, N. J.: Prentice-Hall, Inc., 1959). 

19 Sheldon S. Myers, "Evaluation in Mathematics," Twenty-sixth Yearbook, The 
National Council of Teachers of Mathematics (Washington, D. C.: The Council, 
1961). Reprint available from the Educational Testing Service. 


Measurement Principles in Test Selection 177 


SELECTED REFERENCES 


BUROS, OSCAR K., ed., The 1940 Mental Measurements Yearbook. Highland Park, 
N.J.: The Mental Measurements Yearbook, 1941. 

, ed., The Third Mental Measurements Yearbook. New Brunswick, N.J.: 

Rutgers University Press, 1949. 

, ed., The Fourth Mental Measurements Yearbook. Highland Park, N.J.: 

Gryphon Press, 1953. 

‚ ed., The Fifth Mental Measurements Yearbook. Highland Park, МЈ.: 

Тће Gryphon Press, 1959. 

‚ ed., Tests in Print. Highland Park, N.J.: Gryphon Press, 1961. 

KATZ, MARTIN R., Selecting an Achievement Test: Principles and Procedures. 
Evaluation and Advisory Service Series No. 3. Princeton, N.J.: Educa- 
tional Testing Service, 1958. Available on request. 

SUPER, DONALD E., AND JOHN O. CRITES, Appraising Vocational Fitness, rev. ed. 
New York: Harper & Row, Publishers, Inc., 1962, Chapter 3. 

WESMAN, ALEXANDER G., “Comparability vs. Equivalence of Test Scores," Test 
Service Bulletin No. 53. New York: The Psychological Corporation, 1958. 


Available on request. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


r-made examinations? 


1. In what ways do standardized tests differ from teache 
evaluate two 


2. With the aid of the summary form presented in this chapter, 
or more standardized achievement tests. 

3. Select a test in your subject field f 
Measurements Yearbooks. Summarize an 
test you select. 

4. Why do so many standardized tests i 
terion behavior? 

5. Examine two or three catalogs issued by test p 
sale of their tests to persons qualified to administer t 
Sults? Do they follow the APA code exactly? 

6. What are the chief differences between abilit 
Performance? 


rom those reviewed in Buros’ Mental 
d evaluate the Buros’ reviews of the 


nvolve indirect measurement of cri- 


ublishers. Do they restrict 
hem and interpret the re- 


y tests and tests of typical 


PART TWO 
= 
The Study 
of Individuals 


6 The Measurement of Aptitudes 


Some people think of aptitudes as innate abilities. There is increasing 
awareness, however, that we inherit structures with potentialities for func- 


tional use, rather than abilities; and that the development of one's poten- 
mental factors. We can measure only the 
which represent the 


ructures and environ- 


tialities depends upon environ 
student's performance in his developed abilities, 
cumulated results of interaction between innate sti 
mental situations. 

The problem of estimating the percentage of va 
ability, which is attributable to hereditary vs. environmental factors, has 


appealed to many investigators." However, the current trend is to devote 


greater attention to studies on the development of concepts and problem- 
independence, 


solving abilities and to factors that seem to encourage 1 
flexibility, and resourcefulness in children's problem-solving behavior.” 


riance in general mental 


THE CONCEPTS OF APTITUDE AND ACHIEVEMENT 


Aptitude is defined in the Dictionary of Psychology as “а condition or set 
of characteristics regarded as symptomatic of an individual’s ability to 


acquire with training some (usually specified) knowledge, skill, or set 
of responses such as ability to speak a language, fo produce music, and 


the like."* 


1Cyril Burt, “The Inheritance of Ment 
13 (January 1958), рр. 1-15; Intelligence: 
(Chicago: National Society for the Study of Ed 

2John McV. Hunt, Intelligence and Experien 
Publishers, Inc., 1961). 

3H. С. Warren, Dictionary of 
1934), p. 18. 


al Ability,” American Psychologist, vol. 
Its Nature and Nurture, 39th Yearbook 
ucation, 1940). 

ce (New York: Harper & Row, 


Psychology (Boston: Houghton Mifflin Company, 


181 


STUDY OF INDIVIDUALS 
182 THE 


When we use tests as measures of aptitude, we are concerned with how 
well test scores predict the examinee's future performance in some activity. 
When we use tests as measures of achievement, we are concerned with 
the examinee's present level of performance, as a basis for judging his 
success in past learning activities. . | 

The way in which achievement and aptitude tests overlap in content 
and the extent to which tests vary in their emphasis on specific learnings 
were illustrated in Figure 5.1. Aptitude tests tend to be limited to tasks that 
are (1) equally unfamiliar to all examinees (that is, novel situations) or 

(2) equally familiar (in the sense that all students have had equal oppor- 
tunity to learn, regardless of their pattern of specific courses. 
We tend to use the term "aptitude" in two different ways: 


1. Sometimes we speak of a person's having considerable aptitude for a subject 
(such as reading, science, or mathematics) or for a vocation (such as law 
or teaching). In this sense of the word, "aptitude" connotes a combination 
of traits and abilities that result in the person's being well qualified for train- 
ing in a subject, activity, or occupation. 

. At other times we use the term "aptitude" in a narrower, more scientific 
sense to mean a discrete, unitary ability, such as numerical ability or spatial 
aptitude, which has significance (in varying degrees) for a number of sub- 
jects, activities, and occupations. 


N 


When we develop a test to predict success in some activity, and we 
wish to use the test only for that purpose, we use the term “aptitude” in 
its first meaning, as a combination of abilities that characterize successful 
performers in that activity. We might devise a test composed of many 
items, each of which correlates with success in that activity. We need to 
make no attempt to organize those items into homogeneous groups, designed 
to measure distinct abilities. A test devised to measure aptitude, in the 
first sense of the word, would logically be a heterogeneous, omnibus type 
of test.' Two individuals might earn equal scores on such an omnibus 
test on the basis of superiority in quite different abilities. 

In predicting student success in school work during the elementary 
school years, we are usually satisfied with a scholastic aptitude test of the 
omnibus type. We need a test that provides a composite score on a com- 
bination of abilities related to success in the academic aspects of the school 
program. We do not need a profile of abilities since the abilities of young 
children have not yet become highly differentiated in terms of differing 
interests and the selection of differentiated courses and work experiences. 

In high school counseling, however, we are usually attempting to pre- 
dict student success in many different fields. Moreover, the subject or 
vocation in which we wish to predict success varies from student to student. 
Hence, for purposes of counseling high school students, we would prefer 


4The term “omnibus test” is defined and illustrated in the Glossary. 


The Measurement ој Aptitudes 183 


to measure aptitudes in the second sense of the word, as unitary abilities. 
We would prefer an aptitude test with subtests measuring discrete, fac- 
torially pure abilities. Then, these scores could be combined in various 


ways depending on their significance for different vocations or training 


programs. 
As we review briefly the history of aptitude testing, the reader will note 


that "aptitude" has been used in both senses of the word. Binet, in his 
attempts to identify slow learners, developed a test that was intended to 
measure scholastic aptitude, or aptitude for school learning. No attempt 
was made to identify or measure discrete, unitary components of mental 
ability. The early group tests of mental ability, developed during and after 
World War I, were also omnibus tests, with the items having been selected 
in terms of their relationship to general level of scholarship, or success 
in training programs. 

In World War II, however, the armed se 
with level of general mental ability but with aptitudes for many specific 


jobs. A limited pool of manpower had to be assigned to jobs in which each 
person was likely to do best. The vastness of the classification problem 
led to a realization that a custom-made test of aptitude for each type of 
job was not feasible. If aptitudes could be identified, in the scientific sense 
of the word, as discrete, unitary abilities, tests could be devised that were 
homogeneous in content and fairly independent of each other. Then, the 
prediction job could be reduced to manageable proportions. Recruits could 
be measured on a limited number of aptitudes; their scores on these tests 
could be combined and weighted in different ways so as to predict success 
in many different jobs. Hence, World War II saw the development of 
multiscore tests of mental ability, or aptitude test batteries. Such batteries 
were based on research in factor analysis and designed to measure aptitudes 
in the second or scientific sense of the word. 

In this chapter, we will discuss first the tests 
Then we will turn our attention to multiscore test 
of mental ability. In the multiscore tests, an attemp 
performance in several fairly discrete aptitudes. 
of these two major approaches to aptitude testing, 
be given to more specialized aptitude tests. 


rvices were concerned not only 


of general mental ability. 
s of different components 
t is made to test examinee 
Following the discussion 
briefer consideration will 


TESTS OF GENERAL MENTAL ABILITY OR 
SCHOLASTIC APTITUDE 
Pioneer Work in the Testing of General Mental Ability 


ic instruction in France became concerned 


In 1904 the minister of publ 
ilure in the schools of Paris. He appointed 


about the high percentage of fa 


184 THE STUDY OF INDIVIDUALS 


Alfred Binet to a commission to identify those pupils who were so mentally 
deficient as to require instruction in special classes. Hence the first attempts 
to measure intelligence were designed to measure aptitude for school work. 

In collaboration with Theodore Simon, Binet developed in 1905 an 
individual intelligence test that was to be the prototype of many leading 
intelligence tests still in use and that earned him the title of “father of 
intelligence testing." The 1908 revision was an improvement over the 
1905 scale. Binet assigned each of the 59 tests of this revision to an age 
level (from age 3 to age 13). He introduced for the first time the concept 
of "mental age." Further experimentation resulted in the 1911 revision, 
published in the year of Binet's death. 

Тће most widely accepted revision of the Binet scale was the Stanford 
Revision, published by Lewis M. Terman in 1916. It was accompanied 
by an extensive manual, providing a standardized technique for adminis- 
tration and scoring. For more than 20 years the Stanford-Binet was the 
standard measure of intelligence, the criterion with which all other intel- 
ligence tests, group and individual, were compared. 

In 1912 William Stern proposed the idea of computing the ratio of 
mental age to chronological age as a measure of rate of mental growth. 
He called this ratio a mental quotient. Terman adopted this concept, which 
has since gained universal acceptance, and applied the term “intelligence 
quotient,” or “IQ,” in the Stanford revision of the Binet. 

In an attempt to measure the intelligence of deaf children, the Pintner- 
Paterson Performance Scale was developed in 1917. The scale included a 
series of 15 picture puzzles, form boards, and other tests of a nonverbal 
character. Among the performance tests that have since been developed, 
the Arthur Point Scale of Performance Tests is probably the most widely 
used. Performance tests not only are indispensable in the testing of deaf 
children but also serve as useful supplementary material in the testing 


of young children, as well as bilingual subjects and mental defectives 
of all ages. 


Development of Group Tests of General Mental Ability 


In 1917, the United States government faced the problem of training a 
large army as quickly as possible and of selecting from this large group 
the men who should be trained as officers. The American Psychological 
Association offered its services to the government, and a committee of 
psychologists prepared the first group test of intelligence, the Army Alpha, 
which was administered to nearly two million men. 


5 Lewis M. Terman, The Measurement of Intelligence (Boston: Houghton Mifflin 


Company, 1916). 


The Measurement ој Aptitudes 185 


The testing of illiterate and non-English-speaking soldiers, however, was 
still a problem, for performance tests had to be administered individually. 
The development of the Army Beta marked another milestone in testing 
in that it was a group test that involved performance-type or nonverbal 
items, The directions for the Army Beta could be given by means of 
pantomime. 

The use of the verbal Army Alpha and the nonverbal Army Beta dem- 
onstrated (1) the value of mental tests for revealing individual differences 
in mental ability among people of normal intelligence; (2) the fact that 
mental testing need not be a costly, individual procedure; and (3) the 
value of the tests in the practical classification of men. Within the short 
span of two or three years, group intelligence tests were accepted to 
an extent that would probably not have been attained in a decade or 
more of civilian use. 


Later Developments in Individual Tests of General Mental Ability 


owing its publication, the Stanford Re- 
vision of the Binet test was widely used in schools and clinics, as well 
as in educational research. In 1937, a revision of the Stanford-Binet Scale 
was published,’ with two equivalent forms, L and M. The 1937 tests were 
made less verbal at the lower levels, and the earlier emphasis on rote 
memory a was corrected. . 

ai perii жү measurement of adult intelligence, Wechsler 
published in 1939 the Wechsler-Bellevue Scale for Adolescents and Adults.’ 
The scale included six verbal tests and five performance € po 
was made to include the types of tasks that would interest adults. Norms 


each person 
were provided for persons from ages 10 to 60, the scores for р 


in hi Thus 
е i i 1 hose of others 1m his age group. : 
hes m EM ce of the relative rank of 


Wechsler redefined the intelligence quotient in terms 
an individual in a group of pee of approximately a oe age. P 
An outstanding innovation was the fact that three intelligence qu - 
could be obtained from the test—one from the verbal subtests, one from 
the performance subtests, and a third from the total y Wechsler a 
emphasized the diagnostic value of the pattern sim ета р 
developed equated scores for his subtests so that a pr 
abilities and disabilities could be drawn for each examinee. Later research 
studies revealed, however, that differences between subtest scores had low 


During the 20-year period foll 


6 Lewis M. Terman and Maude А. Merrill, Measuring Intelligence (Boston: 


Houghton МИ и Company, 1937). 
т David Wechsler, The Measurement 0. 
Williams and Wilkins Company, 1939). 


f Adult Intelligence (Baltimore: The 


186 THE STUDY OF INDIVIDUALS 


reliability; and pattern analysis proved to have low validity. Hence, this 
test cannot be considered a multiscore test of mental abilities. 

The Wechsler-Bellevue has now been superseded by two batteries, the 
Wechsler Adult Intelligence Scale (WAIS) for ages 16 and above, and 
the Wechsler Intelligence Scale for Children (WISC) for ages 5-15. The 
WISC was published in 1949, while the revised adult scale (WAIS) was 
published in 1955. In Table 6.1 the characteristics of the Wechsler tests 
are summarized, in comparison with those for the Stanford-Binet. 


Table 6.1 
Comparison of the Stanford-Binet Scales and 
the Wechsler Intelligence Tests 


Stanford-Binet* 


Wechsler? 


TEST CONTENT AND ORGANIZATION 


Terman selected a wide variety of test 
items that measured general mental 
ability but made no attempt to in- 
clude a sufficient number of items of 
any one type to justify computation 
of subscores. 


Tests highly weighted with verbal 
abilities. 


Items arranged in a spiral omnibus 
form, according to increasing degree 
of difficulty. Items organized by age 
levels. 


Specific tests vary from one age level 
to another. Some types of items (for 
example, vocabulary and memory 
for digits) are used over a wide 
range of age levels; others appear 
at only one or two age levels. 


Items of wide range of difficulty in- 
cluded. Children with a mental age 
as low as two or three can be ade- 
quately tested. 


Illustrative items given in Table 6.2 
Items were selected that appeared 
to measure mental ability and on 
which success was highly correlated 
with age." Items were selected for 


Wechsler selected items that repre- 
sented different types of intellectual 
performance. His tests include a 
broader range of tasks than does 
the Stanford-Binet. 


Five performance tests included in 
both WAIS and WISC. 


Items organized by types into subtests, 
grouped into a verbal scale and a 
performance scale. 


The same types of tasks are used at 


all age levels within WAIS and 
WISC. 


Insufficient content is included of right 
difficulty level for testing children 
with mental age below 7. 


The verbal scale of WAIS includes: 
general information, general com- 
prehension, arithmetical reasoning, 
similarities, digit span, and vocabu- 
lary. 


The Measurement of Aptitudes 187 


Stanford-Binet* 


the 1960 edition on the basis of high 
correlation with success on total test. 


VALIDITY 


Since the Stanford-Binet was the only 


widely used individual intelligence 
test for many decades, it has been 
the criterion for determining the 
concurrent validity for many group 
tests. 


The predictive validity of the test in 
predicting academic achievement in 
school has been well established, 
with rs approximating .70 at the 
elementary school level and 60 at 
the high school level. 


Factor analysis studies indicate that 
the test measures somewhat different 
abilities at different age levels, for 
example, reasoning factors that ap- 
peared at three other age levels did 
not appear in the age 11 tests.‘ 


May yield invalid scores for bilingual 
children and others whose ехрегі- 
ence with the English language is 
limited; performance tests should be 
used with these children to obtain 
supplementary data. 


RELIABILITY 


Equivalent-form reliability coefficients 
computed separately for each age 
level. Median reliability coefficient 
for ages 2-6, .88; for older children, 
-93. Especially reliable for examinees 
of low mental ability. Since Form 
L-M includes only items showing 


a $$$ 


Wechsler? 


In WISC, the digit span test is optional. 

The performance scale of WAIS in- 
cludes: digit-symbol substitution, pic- 
ture completion, block design, picture 
arrangement, and object assembly. 

In WISC, the coding of a simple mes- 
sage is substituted for the digit- 
symbol test; a maze test is added as 
an optional test. 


WISC full-scale IQ correlates highly 
with Stanford-Binet, usually showing 
rs in the .805. The verbal scale 
correlates more highly with the Stan- 
ford-Binet than does the perform- 
ance scale." 


The predictive validity of the Wechs- 
ler tests for academic criteria is 
somewhat reduced by the inclusion 
of the performance tests that are 
slightly less reliable and are less 
closely related to academic success. 


Factor analysis indicates that the 
Wechsler tests are factorially com- 
plex and that subtests do not meas- 
ure pure factors. 


Performance scale of special value for 
examinees with a language handicap; 
observance of subjects at work on 
performance tasks may provide clues 
to a qualified psychologist concern- 
ing emotional disturbance or brain 
damage. Routine, objective analysis 
of patterns of subtest scores is of 
questionable validity. 


Split-halves reliability coefficients ob- 
tained—.92 to .95 for full-scale, .88 
to .96 for verbal scale, .86 to .90 
for performance scale. Performance 
test probably most reliable perform- 
ance test now available. Differences 
of less than 15 points between verbal 


188 


THE STUDY OF INDIVIDUALS 


Table 6.1 (Continued) 
Comparison of the Stanford-Binet Scales and 
the Wechsler Intelligence Tests 


Stanford-Binet? 


high relationships to total score, 
reliability coefficients for 1960 edi- 
tion should be higher. Reliability 
coefficients lower for younger chil- 
dren. 


NORMS 


Norms for 1937 edition based on test- 
ing a sample of 3184 subjects, ages 
1% through 18, carefully selected 
to be representative of white, native- 
born population. Seventeen com- 
munities in 11 states were included, 
and an effort was made to have the 
sample representative with respect 
to socioeconomic level. 


Norms for 1960 edition are based on 
the 1937 standardization, adjusted 
in terms of data collected during 
the 1950s. 


Child’s MA computed by adding to 
his basal age the number of months 
earned by tasks successfully рег- 
formed at higher levels.” 


1Q’s on 1960 edition are standard 
scores with a mean of 100 and an 
SD of 16. 


A given raw score (MA) yields the 
same IQ at all adult ages. 


RUNE ние 


Wechsler? 


and performance IQ's should not be 
taken seriously. Reliability of differ- 
ences between subtests too low to 
justify analysis by any objective 
method. Only differences larger than 
3 scaled-score units should be taken 
seriously. 


Norming samples for WAIS represent- 
ative of general population (with 
sampling at each age level stratified 
with respect to region, urban-rural 
residence, occupation, and educa- 
tion). 


Norming samples for WISC drawn 
from 85 communities in 11 states. 


МА not used. 


Verbal, performance, and full-scale 
IQ's are standard scores with a 
mean of 100 and an SD of 15, Sub- 
test scores are converted into scale 
scores with a mean of 10 and an 
SD of 3. 


A given raw score yields different IQ's 
at different ages, depending on its 
Standard-score equivalent in the 
norms for that age group; for ex- 
ample, a specific raw score would 
yield a higher IQ at age 60 than at 
20, for it would rank relatively higher 
with respect to the older group. 


The Measurement ој Aptitudes 189 


Stanford-Binet* Wechsler^ 


For children, adolescents, and young 
adults, Stanford-Binet IQ's average 
approximately 7 points higher than 
Wechsler IQ's. 


USABILITY 
Content interesting to most examinees. Content interesting to most examinees. 
Special training and supervised ex- Special training and supervised ex- 
perience in administering and scoring perience in administering and scor- 
tests is essential. Manual provides ing tests is essential; directions 
adequate basis for attaining objec- somewhat less complex than in 


Stanford-Binet. Manual provides ade- 
quate basis for attaining objectivity 
in scoring. 


tivity in scoring. 


Verbal and performance IQ's can be 
obtained. Scores on subtests can 
provide diagnostic clues, to be ver- 
ified or rejected on the basis of other 
data. 


Diagnostic scores on the examinee's 
performance on different types of 
items cannot be obtained; however, 
Observations of child's reaction to 
standard test situations is of help in 
diagnosis. 

Performance test scores of examinees 
affected little by language handicap. 
Performance tests provide good op- 
portunity to observe behavior of 
subjects who are emotionally dis- 
turbed by their difficulties in prob- 
lem solving. 


ee 


ranging from tasks suitable for the average 


Usual procedure of having examinee 
continue until he fails all tasks at 
а grade level may be seriously up- 
Setting to some children. 


“Tests included for each of 20 levels of ability, 
child of age 2, through four levels developed for differenti 
adults. 

? wisc (Wechsler Intelligence Scale for Children) for ages 5 throug у 
Adult Intelligence Scales) for ages 16 and above. Wechsler’s selection of types of items was 


based on his observations that some mental functions are more disturbed than others in mental 
ioration with age than others. 


jating among average and superior 


gh 15 and WAIS (Wechsler 


Patients, and that some functions show a greater deteri 

e y 
А Items were selected that had а “steep age gradient. 
sidered a good one for seven-year-olds if few six-year-olds, 


of the seven-year-olds got it right. " 
" Judith L. Krugman and others, "Pupil Functioning on the Stanford-Binet and the Wechsler 


Intelligence. Scale for Children," Journal of Consulting Psychology, vol. 15 (December 1951), 
PP. 475-483; W. H. Guertin and others, "Research with the Wechsler-Bellevue Intelligence Scales, 
1950-1955" Psychological Bulletin, vol. 53 (May 1956), pp. 235-257. 

"P. C. Davis, "A Factor Analysis of the Wechsler-Bellevue Intelligence Scale, Form |, in a 
Matrix. with Reference Variables,” American Psychologist, vol, 7 (July 1952), pp. 296-297; В. 


^ For example, a test item was соп- 
most eight-year-olds, and about half 


190 THE STUDY OF INDIVIDUALS 


Table 6.1 (Continued) 
Comparison of the Stanford-Binet Scales and 
the Wechsler Intelligence Tests 


=== 


Balinsky, "An Analysis of the Mental Factors of Various Аде Groups from Nine to Sixty," Genetic 
Psychology Monographs, vol. 23 (February 1941), рр. 191—234. 

TL. V. Jones, "A Factor Analysis of the Stanford-Binet at Four Age Levels,” Psychometrika, 
vol. 14 (December 1949), pp. 299-331. 

* Quinn McNemar, The Revision of the Stanford-Binet Scale: An Analysis of the Standardiza- 
tion Data (Boston: Houghton Mifflin Company, 1942). 

"The basal age for c child is the highest age level at which he got all items correct. For 
example, if a child gets all the items correct at ages 5 and 6, his basal age is 6. If he gets 
two-thirds of the items correct at age 7, one-third of the items at age 8, and none at age 9, 
his MA would be 6 + 8 то. + 4 то. = 7 years. 


In 1960, a new edition of the Stanford-Binet was issued. The items of 
forms L and M that had shown the highest validity (in terms of corre- 
lation with achievement on the total test) were retained in a single form 
known as Form L-M. In the 1960 edition, a set of tests is provided for 
each of 20 ability levels, beginning with tests appropriate for the average 
two-year-old and extending through four levels designed to differentiate 
among average and superior adults. Illustrative items for each of five 


levels are given in Table 6.2, Norms were revised in terms of data 
cumulated during the 1950's. 


Table 6.2 
Illustrative Items at Different Age Levels of the Revised 
Stanford-Binet Scales 


eee 


TWO-YEAR LEVEL 


Identifying six parts of the body on a large paper doll 
"Show me the dolly's hair." Credit for age 2 is given if child correctly identifies 
three; credit for age 2/4 if he correctly identifies all six. 

Picture vocabulary 


The child is asked, "What is this?" What do you call it?” as he is shown 18 
cards containing pictures of common objects (credit for age 2 is given for 
identifying two objects; credit for age 2% if eight are correctly identified). 


SIX-YEAR LEVEL 


Vocabulary 
The same graded list of 45 words is used for age 6 and above. Credit at the 
6-year level is given for six correct definitions of such words as "tap," "orange," 
“envelope,” and the like. The examiner says, "When I say a word, you tell me 
what it means. What is an orange?" 


The Measurement of Aptitudes 191 


Table 6.2 (Continued) 
Illustrative Items at Different Age Levels of the Revised 
Stanford-Binet Scales 


Mutilated pictures 
The child is shown five pictures, each showing an object that has a missing 
part, such as a wagon with three wheels. He is asked, "What is gone in this 
picture?" or “What part is gone?" Four out of five must be correct for credit. 


Number concepts 
Twelve one-inch blocks are put in front of the child. He is asked to give the 
examiner different numbers of blocks, for example, "Give me three blocks. Put 
them here." Four correct answers out of five give credit at the 6-year level. 


Maze tracing 
Three mazes (with starting and finish points marked) are presented in succes- 
sion. In each maze one route is longer than the other. The child is asked to 


trace the shortest route. Two correct out of three give credit. 


TEN-YEAR LEVEL 


Vocabulary 
Credit is given at the 10-year level if 11 or more words are correctly defined. 


Word naming | 
The child is asked to name as many words as he сап in two minutes. Credit at 


the 10-year level is given for 28 words or more. 


Repeating digits 
Six digits are read at one-second intervals. The 
exactly the same order. Three series are presen 
recalls at least one of these series correctly. 


child is asked to repeat them in 
ted. Credit is given if the child 


TWELVE-YEAR LEVEL 


Vocabulary 
Credit is given at the 12-year level if 15 or more words are defined correctly. 

Verbal absurdities кө. ҮК ^ 
Five statements containing absurdities are presented. In each — t Fs sui 
says, “What is foolish about that?” For credit, four of the five absurdities must 
be correctly identified. 


Repeating digits reversed 
The examiner says, “I am going to say some numbers, and I want you to say 
them backwards.” Three series of 5 digits each are presented. The child is given 


credit if at least one of these series is correctly repeated. 


Abstract words - "P s 
The examiner asks, "What do we mean by courage?" Credit is given at this 


level for three correct responses out of four. 


192 THE STUDY OF INDIVIDUALS 


Table 6.2 (Continued) 

Illustrative Items at Different Age Levels of the Revised 
Stanford-Binet Scales 
А 

AVERAGE-ADULT LEVEL 
Vocabulary 
Credit is given at the average-adult level if 20 or more words are correctly 
defined. 
Differences between abstract words 


The subject is asked to distinguish between pairs of associated words, for ex- 
ample, "poverty" and "misery." Credit is given if the subject correctly dis- 
tinguishes between at least two of the three pairs presented. 


Proverbs 


The subject is asked to explain the meaning of three proverbs. Credit is given 
for at least two correct interpretations. 

Ingenuity 
Three novel problems are presented, for example, how would one obtain exactly 


3 pints of water from a river if one has only a 7 pint container and a 4 pint 
container. Credit is given for the correct solution of two out of three problems. 


Later Developments in Group Tests of General Mental Ability 


Many of the early group intelligence tests contained such a large per- 
centage of verbal items that children with language handicaps or reading 
disabilities made spuriously low scores. Hence, educators welcomed group 
intelligence tests that provided language and nonlanguage intelligence 
quotients for children. One of the first group intelligence tests to incorpo- 
rate this feature was the California Test of Mental Maturity. Samples of 
language and nonlanguage items from this test are given in Figure 6.1. 
The Pintner, the Lorge-Thorndike series, and others now provide separate 
IO's on verbal and nonverbal tests. 

Poor readers tend to obtain successively lower intelligence quotients on 
typical tests of general mental ability given in the upper elementary and 
high school grades as these tests include more and more items requiring 
reading ability. The use of a nonverbal intelligence test is helpful in 
determining whether the lower intelligence quotients obtained in the higher 
grades can be attributed chiefly to reading disability. 

Actually, many so-called nonlanguage tests involve the use of consider- 
able language in the giving of directions and in the child's use of verbal 
concepts in working with problems. However, reading ability is not re- 
quired in these tests; hence they are considered to yield more valid estimates 
of learning ability for children who are poor readers. It is a mistake to 
assume, however, that the nonlanguage or nonverbal IO represents the 
pupil's "true ability" or the level of performance he would achieve on 


The Measurement of Aptitudes 193 


LANGUAGE ITEMS 


147. How many 1%-сепї stamps would you give 
in even exchange for 30 one-half-cent stamps? 


10 
15 
20 


45 


йо га 


147 


215. A weighs less than B. 
B weighs less than C. 
Therefore 
1. B weighs more than C. 
2. A's weight equals B's and C's. 
3. A weighs less than С. 
220. W is between X and Y. 
X is between Y and Z. 


Therefore 
1. W is not between Y and Z. 


2. W is between X and 2. 
3. W is nearer to X than to Z. 


—— 15 


NONLANGUAGE ITEMS 


In each row find the drawing that is a different view of the first drawing. 


ch row are alike in some way. Decide how 


The first three pictures in ea л 
he опе picture among the four to the right 


they are alike, and then find t 
of the dotted line that is most like them. 


Sample Items from the California Test of Mental Maturity 


Fig. 6.1 


the California Test Bureau from Elizabeth T. 
t W. Tiegs, California Test of Mental Maturity, 
California Test Bureau, 1957. 


Reprinted with the permission of 
Sullivan, Willis W. Clark, and Егпез 
Junior High, 1957 Edition. Monterey, Calif.: 


194 THE STUDY OF INDIVIDUALS 


verbal tests if his reading handicap were removed. Actually, nonlanguage 
tests measure different aspects of mental ability than the more verbal tests. 
They tend to have lower reliability and lower predictive validity for success 
in almost all school subjects, as well as in most of the vocations for which 
data have been studied. Low predictive validity especially characterizes 
nonlanguage tests that are limited to tasks involving perceptual skills and 
spatial aptitudes, rather than tasks that require reasoning with numbers 
or other symbols. 


Individual vs. Group Tests of General Mental Ability 


Children who have sensory handicaps,’ those who appear to be men- 
tally retarded, and others who do not seem to perform adequately in a 
group testing situation should be given individual tests by trained ex- 
aminers, In an individual test, the informal oral approach permits the 
examiner to establish rapport with the student, to stimulate maximal effort, 
and to observe and to evaluate his behavior. 

The individual intelligence test offers several advantages: (1) greater 
variety of content, (2) opportunity to motivate the individual by means 
of encouragement and praise whenever necessary, (3) opportunity to 
adapt the tempo of testing to the personality of the individual, (4) absence 
of competition with others, and (5) greater opportunity to observe the 
individual's behavior under controlled conditions. 

The disadvantages of the individual test lie in (1) the difficulty of 
administration, a trained examiner being required; and (2) the amount 
of time required for each administration, approximately one hour per sub- 
ject. In city systems employing school psychologists, the use of individual 
intelligence tests is ordinarily limited to identifying the mentally handi- 
capped, testing children with sensory handicaps, verifying the results of 


group intelligence tests for extreme deviates, and making diagnostic studies 
of children with special problems. 


MULTISCORE TESTS OF MENTAL ABILITIES AND APTITUDE 
TEST BATTERIES 


As a result of factor analysis studies by Thurstone and by many other 
factor analysts, considerable agreement has been reached concerning several 
components of mental ability. Table 6.3 lists 12 ability factors that have 
been confirmed by two or more factor-analysis studies. 


s For deaf children, the Amoss-Ontario School Ability Examination or the 
Nebraska Test of Learning Aptitude can be used. For those who are visually 
handicapped, the Hayes Adaptations of the Binet and Wechsler-Bellevue Scales are 
available. The Arnold Adaptation of the Leiter International Performance Scale 
and the Columbia Mental Maturity Scale can be used with the multiply-handicapped 


child. 


Table 6.3 
Ability Factors Confirmed by Two or More Factor Analysis Studies 


VERBAL FACTORS 


V,—Verbal comprehension—ability to understand words and written materials, 
such as in vocabulary tests (synonyms or antonyms, defining words), detect- 
ing absurdities in stories, sentence completion, reading comprehension. 

V,—VWord fluency—ability to produce words rapidly (as in rhyming, supplying 
synonyms for easy words, listing words in a category such as foods, or listing 
as many four-letter words as possible beginning with C). 


REASONING FACTORS 


N—Numerical computation—speed of solving simple arithmetic computations. 

R,—General reasoning—ability to invent solutions to problems, as in arithmetic 
reasoning problems. 

R,—Deduction—ability to draw conclusions, as in logical syllogisms. 

R,—Eduction of relationships—ability to see the relationship between two things 
or ideas and use this relationship to select other things or ideas, such as in 
verbal analogies. 


MEMORY FACTORS 


M,—Rote memory—ability to remember simple associations in which meaning is 
of little importance, for example, ability to study pairs of names and num- 
bers, words and colors, and the like for a minute or two, and then show 
immediate recall of the pairs when only the names or words are given on 
the next page. 

M,,—Meaningful memory—ability to memo 
sentences, lines of poetry, pairs of words t 
the like. 


rize meaningful material, such as 
hat are meaningfully related, and 


SPATIAL FACTORS 


5,—браца! orientation—ability to detect quickly and accurately the spatial ar- 
rangement of objects with respect to one’s own body, for example, detecting 
what maneuver an airplane is going through by examining a picture of the 
landscape from that vantage point. Spatial orientation seems to require an 
actual or imagined adjustment of one’s own body. sll 

S;—Spatial visualization—ability to imagine how an object would look if its 
spatial position were changed, for example, the examinee is shown a folded 
paper with several holes in it and is asked to choose one of four or five 


alternatives that shows how the unfolded paper would look. 


PERCEPTUAL FACTORS 

P,—Perceptual speed—ability to recognize perceptual details rapidly, especially 
similarities and differences between visual patterns, for example, checking 
pairs of letter groups or number groups that are identical (making no mark 
when they are different); choosing from several alternatives а geometrical 
form like the one first presented. 

P—Perceptual closure—the perception о 
mental “putting together" of a perceptu 
part of it is presented (as if partially erased 

c O O O O O OO O MM E eL E 
Source: Condensed from a review of such studies in Jum C. Nunnally, Jr., Tests and Measure- 
ments: Assessment and Prediction (New York: McGraw-Hill Book Company, 1959), pp. 174-180. 


Е objects from limited cues, that is, the 
al form, such as a word when only 
or blurred). 


Numerical Space Clerical 
Ability ^ Relations ^ Sp.8Acc. TUE Usage 


Verbal Abstract |Mechanical Sentences 


Reasoning | Reasoning | Reasoning Spelling 


99 


95 95 
90 90 
80 80 
15 15 
70 70 
„60 60 
$ 5 
x 50 50 = 
Bw 40 $ 
& 30 30 = 
25 25 
20 20 


Fig. 6.2 Profile of Mary Dale's Percentile Ranks on the Differential Apti- 
tude Tests 


Adapted with the permission of the publisher from George K. Bennett 
and others, Counseling from Profiles: A Casebook for the Differential Aptitude 
Tests (New York: The Psychological Corporation, 1951), pp. 22-23. 


NOTE: If the vertical difference between the heights of any two bars on 
the profile (before reduction) is one inch or more, the difference prob- 
ably represents a real difference in the abilities measured (that is, the 
difference is significant at the 5-percent level). If the difference is be- 
tween one-half and one inch, the student is instructed "to consider 
whether other things you know about yourself agree with it^; if the 
difference is less than one-half inch, the student is asked to disregard it 
as "probably not meaningful." 


The Measurement ој Aptitudes 197 


Mary Dale 


PROBLEM 


Mary sought help in her planning for higher education, having had to give up 
her ambition to study medicine. 


TESTS 


Differential Aptitude Tests, Grade 9. 
An IQ of 111 was reported; test name and date of testing not given. 


REPORT OF COUNSELING IN GRADE 12 


Mary has a relatively good achievement record. She was undecided about 
her plans for college. Her initial goal had been to become a doctor. Because of 
the inability of her family to contribute financially to her college training, it 
Was necessary for Mary to try for a scholarship. She failed in this competition. 

Mary was assisted in understanding herself a little better through discussion 
of her Differential Aptitude Test results with her counselor. It is apparent from 
her tests that she does not have the superior aptitudes required of students who 
ате awarded scholarships to prepare for medical training. She is now working 
ànd intends, after a year, to enter either а school for laboratory technicians or 
à medical secretarial school. In view of her persisting interests, her present 
Plans seem more realistic. The test profile suggests adequate ability for either 
the technician course or the secretarial course—except that, for the former, a 
higher Mechanical Reasoning score might be desirable and, for the latter, her 
Clerical Speed and Accuracy and Spelling ratings are à bit low. 


[Ebrrons'] COMMENTS 


The Differential Aptitude Test profile shows that Mary probably could enter 
4 not-too-demanding college and do creditable work. Her present plans, how- 
ever, seem to offer reasonable ways for her to satisfy her medical interests. 
Counseling should always take into account the motive behind the original 
Choice of a goal; if Mary's desire to be a doctor reflected a concern for helping 
People, perhaps she would be happier as а social worker connected with a 
hospital or as a nurse. If her interests are primarily technical, then her present 
plans are probably wiser. Merely scaling the ambitions downward is not enough; 
Positive suggestions for exploring alternates which are consistent with interests 
and abilities are just as necessary parts of a sound counseling process. 

The foregoing comments refer to Магуз vocational plan. It should also be 
Pointed out that a girl of Магу abilities might find one, two, or four years of 
liberal arts or general education to be a valuable experience apart from voca- 
tional preparation. As a matter of fact, her vocational plan can be nicely 
Integrated with a program in general education in either junior college or in a 
Tegular though not-too-demanding liberal arts college. Counseling problems are 
80 often expressed in terms of vocational needs of young people that counselors 
Sometimes forget the important values of education for citizenship and for 
developing mature cutural interests. 


198 THE STUDY OF INDIVIDUALS 


Tests of the various factors of mental ability can be especially valuable 
for use with high school students who seek assistance in choosing among 
various curricula and vocations. Instead of a single IQ, the tests provide 
a profile of students’ scores on a test of several components of mental 
ability." An illustration of an aptitude test profile and its interpretation is 
given in Figure 6:2, : 

As more research data are made available, counselors can use the results 
of such multiscore or multifactor tests with increasing effectiveness as aids 
in counseling students. Until supporting research data are available, how- 
ever, one cannot assume that high scores in verbal meaning predict success 
in linguistic pursuits or that high scores in space relations predict success in 
art and architecture. 

When Thurstone completed his pioneer studies in the factor analysis 
of mental abilities, he developed a test battery designed to measure a 
number of primary factors that appeared to be sufficiently independent to 
justify their separate measurement. He designed the Tests of Primary 

Mental Abilities,° which provide a profile of percentile ranks on six 
important factors of mental abilities—number, verbal meaning, space, 
word fluency, reasoning, and memory. 

The PMA and other multiscore intelligence tests were developed largely 
on the basis of theoretical interest in the components of mental ability, 
while vocational aptitude test batteries were originally developed to meet 
practical problems of selection and classification. No clear-cut distinction, 
however, can be made between multiscore intelligence tests and aptitude 
test batteries. Both types of aptitude tests have been developed on the 
basis of research designed to identify and measure components of mental 
ability that (1) predict future achievement, (2) are relatively homo- 
geneous, and (3) are fairly independent of one another, In each case, all 
the subtests are normed on the same standardization group, so that con- 
verted scores are comparable and a profile of the student's abilities 
can be drawn. 

As predictive validity coefficients for success in vocations (or voca- 
tional training courses) are obtained for multiscore intelligence tests, and 
predictive validity data on success in various school subjects are obtained 


° Іп analyzing a student's profile of comparative achievements or abilities, it is 
important that the subtests be highly reliable and that they measure abilities that 
are fairly independent of one another (as shown by low intercorrelations). The 

m of the statistical significance of differences between pairs of scores is 

proble g Я 
discussed in Chapter 3. мй 

10 hicago Test of Primary Mental Abilities (Chicago: Science Research 

The Chicag | E 

Associates, Inc., 1941). This test was designed for ages 11-17. An abbreviated 

ale based on five of these factors has been developed, as well as tests of primary 
~ ntal abilities for children of the elementary grades: SRA Primary Mental Abili- 
а ages 5-7, ages 7-11, ages 11-17. For further data on these tests and references 
ies, à E 


to critical reviews, see Appendix A. 


The Measurement of Aptitudes 199 


for vocational aptitude test batteries, the distinction between these two 
types of batteries becomes insignificant, despite their different historical 
origins. In fact, the publishers of the DAT, a vocational aptitude test, 
have suggested that the scores on tests WR and NA be combined as an 
index of scholastic aptitude. They have also developed a junior edition of 
the four subtests of the DAT that have the greatest predictive validity in 
educational decision-making and have published them as the APT (or 
Academic Promise) tests. Today the test user can simply choose the bat- 
tery that has the greatest predictive validity for his own purposes without 
regard to the historical origins of the test. 

The abilities measured by the PMA, a leading multiscore intelligence 
test, and, by the DAT, a leading vocational aptitude test battery, are com- 
pared below. It will be noted that in four of the subtests, the two 


РМАЛ DATI? 


(Chicago Test of Primary Mental 


Abilities, Single Booklet Edition) (Differential Aptitude Tests) 


Numerical ability 


Number 

Verbal meaning Verbal reasoning 
Space Space relations 
Word fluency 

Reasoning Abstract reasoning 
Memory 


Mechanical reasoning 
Clerical speed and accuracy 


Language usage 


batteries appear to measure similar abilities. The difference between these 


two batteries seems to be chiefly one of emphasis. The PMA includes 
tests of word fluency and memory, which probably have greater significance 
in predicting school achievement than vocational success, whereas the 
DAT subtests of mechanical reasoning, clerical speed and accuracy, and 
language usage are included primarily for their value in predicting success 
in а family or group of occupations (mechanical and clerical, respec- 
lively). Another difference is that the PMA attempts to measure relatively 
pure” or uncorrelated mental abilities. Research studies, however, reveal 
that the subtests that presume to measure factors are still somewhat 
impure."* Although the DAT battery claims to measure abilities that are 


Thelma Gwinn Thurstone and L. L. Thurstone, Chicago Test of Primary Mental 
Abilities (Chicago: Science Research Associates, Inc., 1941). 
2G. К. Bennett, Н. G. Seashore, and А. С. Wesman, The Differential Aptitude 
Tests (New York: The Psychological Corporation, 1947). 
На * A. B. Crawford, and P. S. Burnham, Forecasting College 
aven, Conn.: Yale University Press, 1946). 


Achievement. (New 


200 THE STUDY OF INDIVIDUALS 


relatively distinct, its emphasis—particularly in the last three tests—is on 
the prediction of success in a group of occupations rather than on the 
isolation of factors of mental ability. 

These two tests also illustrate the difference between an aptitude test 
in which speed is a factor and one in which it is not. The PMA sets time 
limits that are so brief that a student’s speed of work is a significant factor 
in determining his scores. In fact, Super and Crites contend that “speed 
plays too important а part in all the tests.”** With the exception of the 
subtest on clerical speed and accuracy, the tests of the DAT battery are 
power tests. The time limits are so liberal that almost all students are able 
to complete the tests within the time limits allowed. The relative desira- 
bility of speed and power tests of aptitude is best determined by their 
relative value for a specific prediction problem, For example, a speed 
test may be far superior for measuring aptitude for certain clerical func- 
tions (for example, rapid proofreading of names and numbers). On the 
other hand, a power test may be preferable for measuring more complex 
abilities, such as engineering aptitude or ability to learn foreign languages. 

New forms of the DAT, incorporating minor revisions to facilitate ad- 
ministration and scoring, were published in 1963; the new standardization 
involved 45,000 students from 192 schools in 43 states. In Table 6.4 the 
DAT has been compared with the General Aptitude Test Battery, which 
has proved to be very valuable in the counseling of high school youth 
who are not college-bound. 

Although the GATB was developed primarily for use by st 
ment services, these agencies are interested in helping high school seniors 
who are ready to enter the labor market. Where the GATB can be used 
effectively for this purpose, cooperative plans have often been 
on a local basis for the use of the battery in the schools. 

Dvorak suggests nine steps for inaugurating a testing and counseling pro- 
gram that utilizes the GATB: (1) cooperative planning between repre- 
sentatives of the schools and the local employment service; (2) starting 
the program early in the last school year; (3) orientation of the students 
with respect to the program; (4) screening (in order to eliminate those 
students who are going on to college, who have made a final vocational 
choice, or who are not entering the labor market immediately); (5) ad- 
ministration of the tests (usually by employment-service personnel but 
sometimes by school guidance workers); (6) counseling interviews (by 
employment-service personnel, and sometimes also by school personnel); 
(7) transmission of records and information (with an interchange of test 


ate employ- 


developed 


14 Donald E. Super and John О. Crites, Appraising Vocational Fitness by Means 
of Psychological Tests, rev. ed. (New York: Harper & Row, Publishers, Inc., 1962), 
p. 137. 


The Measurement of Aptitudes 201 


Table 6.4 
Comparison of the DAT and GATB Aptitude Test Batteries 


Differential Aptitude Tests General Aptitude Test Battery 


______________________________ 


TEST CONTENT AND ORGANIZATION 


Eight tests, four measuring aptitudes in Nine factor scores, obtained from 12 
the strict sense of the term (VR— tests, as follows: 
verbal reasoning, NA—numerical G-— general learning ability 
ability, SR— space relations, and per- V—verbal aptitude 
haps AR—abstract reasoning); two N—numerical aptitude 
that are factorially complex tests of S—spatial aptitude 
aptitudes (MR—mechanical reason- P—form perception 
ing, CSA—clerical speed and accu- Quclerical perception 
racy) and two that are proficiency K—motor coordination 
tests with predictive value (LU I F—finger dexterity 
and LU II—language usage: spelling M—manual dexterity 
and sentences). 

Measures selected variables known to Measures most of the aptitudes that 
be of value in educational and vo- have been isolated and found to be 
cational counseling. occupationally significant. 

Includes tests of mechanical compre- Includes tests of form perception, eye- 
hension and language, not included hand coordination, motor speed, fin- 


in GATB. ger dexterity, and manual dexterity, 
not included in DAT. 


TEST VALIDITY 


Tests of similar names evidently measure similar, but not identical, abilities, as 
evidenced by the following r's between similar tests: verbal .72, space .72, numerical 
:62, and clerical .53.4 


Factor of général mental atility ac- As basis for construction of battery, 
counts for a substantial proportion factor-analysis studies were con- 
of variance in all tests except CSA. ducted with 59 tests, administered in 
However, tests are reasonably inde- 9 overlapping batteries. On the basis 
pendent, the average intercorrela- of these studies, ten factors were 
tion between subtests being .38. identified and 15 tests chosen to 

measure them. In a later revision, 

the number of factors was reduced 

to nine, the number of tests to 12. 

Average intercorrelation between 


subtests is only .28. 


More school-oriented, designed pri- More work-oriented, developed by Oc- 
marily for use in high school coun- cupational Analysis Division, United 
seling, States Employment Service, prima- 

rily for use in vocational counseling 
of applicants for employment. 


202, THE STUDY OF INDIVIDUALS 


Table 6.4 (Continued) 
Comparison of the DAT and GATB Aptitude Test Batteries 


————————————————————————— 


Differential Aptitude Tests 


General Aptitude Test Battery 


== ГН 


Considerable data available concerning 
predictive validity of tests for grades 
in high school and college subjects. 
Has moderate differential validity 
for subject fields. Very little data 
available on concurrent or predictive 
validity for success in specific voca- 
tions or vocational training programs. 


VR and NA score usable as measure 
of general scholastic aptitude. 


Except for CSA, all tests аге power 
tests, with generous time limits. 


TEST RELIABILITY 


Tests long enough to give fairly re- 
liable results. Average reliability co- 
efficient .88 (Split-halves method 
used except for speeded CSA). 


High intercorrefations between certain 
tests reduces reliability of differences. 


Studies regarding stability of difference 
Scores over a three-year period in- 
dicate that differences among CSA, 
MR, and the over-all level of the 
verbal-language-numerical tests are 
stable enough to be interpreted se- 
riously. 


Norms for each grade and sex (large, 
representative norming samples; in 
all, more than 45,000 students in 
192 schools in 43 states involved 
in 1962 norming). 


Percentile norms. 


“Most adequately standardized and 
validated battery now available for 
vocational counseling and placement 
of inexperienced young persons and 
adults."^ Considerable validation data 
(for 145 jobs and 23 occupational 
families). However, some studies 
provide only concurrent validity data. 


Typical validity coefficients average 
.50. 


С score usable as measure of general 
scholastic aptitude. 


Tests are highly speeded. 


Tests shorter and somewhat less re- 
liable, especially tests of perception, 
coordination, and dexterity (for 
Which reliability coefficients range 
from .65 to :79). 


Low intercorrelations between most 


tests result in lower standard errors 
for difference scores, 


Retest reliabilities over a three-year 
period (9th to 12th grade) almost 
as high as those for a three-month 
period. 


Multiple norms on many occupational 
groups. Number of cases in each 
occupational group is generally 50- 
200. Profiles for occupational fam- 
ilies based on norming samples of 
60-900. Critical scores on each of 
the three most crucial factors for 
each of 23 occupational families, 


Standard scores with mean of 100 and 
5D of 20. 


The Measurement ој Aptitudes 203 


Differential Aptitude Tests General Aptitude Test Battery 

Norms by grade level indicate changes Conversion tables make it possible for 
in aptitude with increasing maturity counselors to estimate individual's 
and experience. probable adult status on any test 
from his score as a high school 

student. 

USABILITY 

АП tests machine-scorable paper-and- Eight of 12 tests machine-scorable; all 
pencil tests. quickly scored. Two tests involve 


use of simple apparatus, but can be 
administered in groups. 


Alternate forms available for all tests. Alternate forms available for first seven 
tests. 


Any or all tests available for adminis- Tests given only through State Em- 
tration and scoring by any school dis- ployment Service to high school stu- 
trict. AII eight tests available in new dents who plan to seek employment, 
two-booklet edition, used with only rather than attend college. Coopera- 
two answer sheets. tive plan makes results available both 

to high school counselor and em- 
ployment service. 


Versions for use in at least 27 foreign 
countries have been prepared. 


s OO OOOO — ———— 


Appraising Vocational Fitness by Means of 
Publishers, Inc., 1962), pp. 328-349; 


Spanish edition available. 


Source: Donald E, Super and John O. Crites, 


adsolet Tesis, rev. ed. (New York: Harper & Row, " 
merican Personnel and Guidance Association, The Use of Multifactor Tests in Guidance (Wash- 


ington, D.C.: The Association, 1957); Manual for the Differential Aptitude Tests, 3d ed. (New 
за The Psychological Corporation, 1957); and Guide to the Use of the General Aptitude Test 
‘attery (Washington, D.C.: Government Printing Office, 1958). 


: “Guide to the Use of the General Aptitude Test Battery (Washington, D. С: Government 
'inting Office, 1958), p. 1-2. 


b 
Super and Crites, op. cit., p. 338. 


data, personnel-record data, and the like between school and employment 
agency) and, ideally, case conferences on a number of students; (8) train- 
ing of school personnel by state employment-service staff members in the 
Use of the GATB and the various types of information available on 
the Occupational and labor market and training of employment-service 
Staff members by school personnel in the interpretation of previous test 
Scores and other information on the school cumulative records; and (9) 
follow-up studies on the vocational adjustment of students tested, devel- 


204 THE STUDY OF INDIVIDUALS 


oped cooperatively by the school and employment-service personnel.” 

Although these general policies have been outlined at the national level, 
it is recommended that any schools desiring to use the GATB consult 
the nearest office of their state employment service regarding the specific 
conditions under which the tests may be used.1¢ 

We shall not attempt to describe other aptitude test batteries. However, 
the subtests for each battery are listed in Appendix A. In making the dif- 
ficult but highly important choice among aptitude test batteries, the teacher 
should consult the professional literature and revised test manuals for 
reports on research studies as well as reviews in the latest Buros Yearbook. 


TESTS OF SPECIAL APTITUDES 


Before aptitude test batteries were developed, the high school counselor 
who wished to do vocational aptitude testing had no choice but to admin- 
ister a number of tests of single aptitudes (frequently called special 
aptitude tests). Outstanding examples of such special aptitude tests are 
the Minnesota Clerical Test, the Revised Minnesota Paper Formboard, the 
Bennett Test of Mechanical Comprehension, the Meier Art Judgment Test, 
and the Seashore Measures of Musical Talents. Each of these tests is one 
of the best in its field. 

Although each special aptitude test may provide percentile norms, these 
percentiles are not based on the same norming population. Hence, unless 
local norms have been established, there is no basis on which the student’s 
profile of relative abilities on several special aptitude tests may be drawn. 
On an aptitude test battery, however, subtest percentile ranks are com- 
parable, since they have all been normed on the same students, 

Special aptitude tests, however, may have the advantage of having been 
normed on several occupational groups or on groups of students enrolled 
in special curricula. Therefore, a student’s score on, say, the clerical 
aptitude test can be compared with the scores of persons employed in 
routine clerical work, with those of bookkeepers, and the like. Tests of 
single aptitudes serve certain other functions, for example, to round out 
the picture of vocational strengths and weaknesses for individual students 
or to assist members of special departments, such as art and music, in 


15 Beatrice Dvorak, “Proposal for Organizing a Multi-Factor Testing Program for 
Vocational Counseling,” Conference on Using Multi-Factor Aptitude Tests in Edu- 
cational and Vocational Counseling and Prediction (Berkeley, Calif.: University of 
California, Field Service Center, 1953), pp. 43-46, 


16 A, W. Motley, Assistant Director, United States Employment Service, letter to 
the authors, April 29, 1955. 


The Measurement of Aptitudes 205 


selecting and guiding students within their respective fields. No subtests 
оп aptitudes in the arts are included in aptitude test batteries. 


Performance Tests vs. Paper-and-Pencil Tests 


Although aptitude test batteries are almost exclusively of the paper- 
and-pencil type, special aptitude tests may be of the performance type. 
Each has its special advantages and limitations. А | | 

In the strict sense of the word, performance tests require manipulative 
skills and involve the actual use of apparatus or materials. As such, they 
may be more concrete and meaningful to students and employees than 
pencil-and-paper tests, for the problems included in such tests closely 
resemble those involved in employment or training situations. 

Performance tests have greater appeal, and therefore probably greater 
Validity, for the less academic students. Paper-and-pencil tests are rela- 
tively abstract and require a higher degree of inductive reasoning on the 
Part of the subject. Obviously, paper-and-pencil tests are less expensive 
than performance tests, for they can usually be administered in groups 
and can be scored quickly; consequently, they may be the only feasible 
choice in a school testing program. However, measurement of such abilities 
as manual dexterity, filing ability, and the like by means of paper-and- 
Pencil tests is necessarily indirect, and the validity of such tests must 
be established through statistical studies of their ability to predict 
Successful performance. К ја 

Figure 6.3 and the accompanying Table 6.5 offer a direct comparison 
between a performance test (the Minnesota Spatial Relations Test) and 
15 paper-and-pencil "equivalent" (the Minnesota Paper Form Board). 
Careful study of these data indicates that the two tests are by no means 
identical, The performance test may be more useful in determining whether 
individuals desiring to enter certain semiskilled or skilled occupations have 
the required capacity for spatial visualization or manual dexterity. The 
Paper-and-pencil test can be used for testing large numbers of students 
Who wish to appraise their spatial judgment as one factor to be considered 


In their vocational planning. 


Tests of Manual Dexterity 


, If the General Aptitude Test Battery (GATB) is given to high school 
Juniors or seniors intending to enter employment, the performance tests 
this battery will provide evidence concerning the student's motor coordi- 
Nation and his dexterity in finger and hand movements. If the GATB is 
Not given, however, a counselor may wish to administer a test of manual 


206 


THE STUDY OF INDIVIDUALS 


Table 6.5 
Comparison of a Performance Test and a Pencil-and-Paper Test 
in the Same Area 


== 


Minnesota Spatial Relations Test 


Minnesota Paper Form Board 


= == 


ADMINISTRATION 


Administered individually. 


May be administered in groups. 


SCORING 


Number of seconds required to com- 
plete last three form boards. 


Number of correct responses in 20 min- 
utes. May be machine scored. 


CONTENT 


Four form boards, three feet long by 
one foot wide. Shapes include cres- 
cents, squares, and odd-shaped geo- 
metrical forms. 


Sixty-four items. For each item, the 
“stem” consists of two to five disar- 
ranged parts of a geometric figure. 
The student chooses one of five re- 
sponses (assembled geometric fig- 
ures). The problem in each case is to 
select the figure that can be assembled 
from the parts. To make the appro- 
priate figure, it may be necessary 
merely to push the parts together, 
or to turn them around or even over. 
All matching of shapes must be done 
mentally; no trial-and-error work is 
possible. 


AVERAGE TIME 


20 to 25 minutes. 


Twenty minutes after practice problems. 


FACTORS MEASURED 


Ability to visualize and judge spatial 
relations; ability to perceive spatial 
differences; reasoning. 


Ability to visualize and judge spatial 
relations; perceptual ability; inductive 
reasoning. The reasoning factor is 
more heavily weighted than in per- 
formance test. Scores are mixed 
speed-and-level scores, with level of 
difficulty playing a lesser part. 


n 


Source: Data summarized from Donald E. Super and John O. Crites, Appraising Vocational 
Fitness by Means of Psychological Tests, rev. ed. (New York: Harper & Row, Publishers, Inc., 


1962), pp. 281-300. 


Fig. 6.3 A Performance Test (above, Minnesota Spatial Relations Test) 
and a Paper-and-Pencil Test (below, Minnesota Paper Form Board Test) 


of Spatial Relations 


0.9 
= 
IY. 
AA 
aA. 
Exil 
m 
D 


о 


E Us 7i 8 Т A a | А if ———. А 
СУ] (526) [etes] PU 
cio OW Ig; | 


208 THE STUDY OF INDIVIDUALS 


dexterity to a student for whom such information seems significant. For 
example, a student who is interested in dentistry and has all the academic 

ualifications might wonder if he has sufficient manual dexterity for such 
highly skilled work. Since such tests must usually be administered indi- 
vidually by someone who is specially trained, manual dexterity tests should 
ordinarily be reserved for the study of students with special problems in 
vocational counseling. Low scores must be interpreted in light of research 
concerning the trainability of certain skills. 

Super and Crites have reviewed a number of validation and norming 
studies on the Minnesota Rate of Manipulation Test—a test of gross arm- 
hand dexterity." On the basis of their review, they doubt whether tests of 
gross manual dexterity have any value in the counseling of high school 
students. Their value lies chiefly in the selection or preemployment counsel- 
ing of persons seeking such jobs as packing or large-part assembly. 

Tests of finer dexterity, such as the O'Connor Finger and Tweezer Dex- 
terity Tests, may be of value in counseling high school students desiring 
work in assembling small parts or in professions such as dentistry, which 
require fine manual dexterity. 

The Purdue Pegboard measures both arm-hand dexterity (of a finer 
type than the Minnesota test) and finger dexterity (in a more realistic 
situation than the O’Connor tests). The fact that the student’s ability to 
coordinate the action of both hands and to eliminate nonessential opera- 
tions affects his scores may make this a more valid instrument for use in 
predicting success in certain occupations. Although high school norms 


are not now available, the early maturation of manual dexterity may make 
the adult norms applicable. 


Tests of Clerical Aptitudes 


Since several aptitude test batteries include tests of clerical aptitude, 
schools make limited use of special aptitude tests in this field. 

Two widely used tests illustrate different approaches to the measure- 
ment of clerical aptitude. The Minnesota Clerical Test is a highly speeded, 
homogeneous test of clerical speed and accuracy, similar to those included 
in many aptitude test batteries. The General Clerical Test is a much more 
comprehensive test, measuring three types of ability important in office 
work: clerical speed and accuracy, numerical ability, and verbal facility. 


Obviously, the critical scores required in these three tests would vary 
considerably for different types of office jobs. 


17 Super and Crites, op. cit., pp. 182—217. 


18 Johnson O'Connor, O'Connor Finger and Tweezer Dexterity Tests (Chicago: 
C. H. Stoelting, 1928); reviewed in Super and Crites, op. cit., pp. 200-213. 


The Measurement of Aptitudes 209 


APTITUDE TESTS IN ART Of special interest are the aptitude tests in 
the fine arts, because aptitude test batteries do not attempt to measure the 
abilities requisite for high-level performance in these fields. 

The Meier Art Tests: 1 Art Judgment, designed for use in grades 
7-12, require the student to select the more artistic picture in each of one 
hundred pairs of pictures. The test may be administered individually or to 
а group; the scoring is simple and objective. The test seems to measure 
а refined type of aesthetic judgment that matures during the secondary 
School and adult years.?? 

A new Aesthetic Perception test has been completed by Meier and his 
associates; the final steps in the item validation and standardization of 
this test were completed in 1963. According to Meier, the new test “differs 
from test I... in the degree of penetration involved: the subject must 
evaluate four versions of each art product in order to rank them in order 
of aesthetic character."? 

The Graves’ Design Judgment Test? has been designed to avoid the use 
Of representational art. All of the designs are abstract. Each exercise of 
this test consists of two or three designs, one of which is organized accord- 
ing to the fundamental principles of art structure, the other design or de- 
Signs violating one or more such principles. Norms are available for high 
School and college students. The test may be administered in 20 to 30 


minutes and may be machine scored. 


Aptitude Tests in Music 


One of the earliest tests of special aptitude was the Seashore Measures 
ој Musical Talent, developed in 1919. After the tests had been used for 
20 years and extensive research on their use had cumulated, a revised edi- 
tion was issued in 1939. The tests are now available on a single long- 
playing record and are adapted for machine scoring.” The following six 
elements are measured: (1) pitch, (2) loudness, (3) time, (4) timbre, 
(5) rhythm, and (6) tonal memory. In the first test (sense of pitch), for 
example, a series of paired sounds is played. The student is required to in- 
dicate for each pair whether the second sound is higher or lower in pitch 
than the first. Norms are available for grades 5 and 6, grades 7 and 8, 


19 Norman C. Meier, Meier Art Tests: 1, Art Judgment (Iowa City, Iowa: Bureau 
of Educational Research and Service, University of Iowa, 1940). 

RS Super and Crites, op. cit., p. 309. 

?1 Letter to the author from Dr. Norman C. Meier, May 11, 1963. . 

22 Maitland Graves, Design Judgment Test (New York: The Psychological 

Orporation, 1948). 

28 Carl E. Seashore, Don Lewis, and Joseph С. Saetviet, Seashore Measures of 
Music Talent, rev. ed. (New York: The Psychological Corporation, 1960). 


230 THE STUDY OF INDIVIDUALS 


and adults. Research studies indicate that special training in music does not 
influence test scores. 

The Musical Aptitude Test, developed by Whistler and Thorpe,** con- 
sists of short exercises played on a piano. The test includes subtests in (1) 
rhythm recognition, (2) pitch recognition, (3) melody recognition, (4) 
pitch discrimination, and (5) advanced rhythm recognition. Students’ re- 
sponses are recorded on answer sheets and can be machine scored. Percen- 
tile norms for grades 4 and above are provided. 

The Drake Musical Aptitude Test, measures two significant music apti- 
tudes—musical memory and rhythm. Two equivalent forms of the test are 
available on a single long-playing record. The test can be administered to 
the least talented child and yet is difficult enough for the gifted adult. 
Results do not seem to be affected by musical training. 

The Wing Standardized Tests of Musical Intelligence, developed in 
England, includes seven tests that cover chord analysis, pitch change, 
memory, rhythmic accent, harmony, intensity, and phrasing. The first three 
parts require students to make sensory discriminations, but at a more 
complex level than the Seashore tests; while the other four require students 
to compare the aesthetic merits of pairs of selections of piano music. 
Norms are provided on total scores. The tests have a sufficiently high ceil- 
ing to differentiate among talented individuals. Validity studies with small 


groups have resulted in correlations of .60-.70 with teachers’ ratings of 
musical ability. 


PROGNOSTIC TESTS 


The last type of aptitude tests we will consider illustrates the use of the 
term "aptitude" in its first sense, that is, aptitude for learning in some 
subject field. As our initial discussion suggested, these tests may be 
omnibus tests. They may include subtests on abilities considered necessary 
in the subject field, but no attempt has usually been made to have these 
subtests measure aptitudes in the sense of discrete, unitary abilities. 


READING-READINESS TESTS Perhaps the most widely used tests in this 
group are the reading-readiness tests. Typically, they measure the pupil's 
knowledge of oral vocabulary, his ability to follow oral directions, to match 


24 Harvey S. Whistler and Louis P. Thorpe, Musical Aptitude Test, Series A 
(Monterey, Calif.: California Test Bureau, 1950). 

25 Raleigh M. Drake, Drake Musical Aptitude Tests (Chicago: Science Research 
Associates, Inc., 1954). 

26 Wing Standardized Tests of Musical Intelligence (London: National Founda- 
tion for Educational Research in England and Wales, 1939). 


The Measurement of Aptitudes 211 


visual stimuli, to match sounds as in rhyming, and the like. A "job analysis" 
of beginning reading has led to hypotheses concerning types of items that 
might have predictive validity. Those items that have actually shown pre- 
dictive validity in several different situations have been retained. A well- 
designed reading-readiness test does seem to have higher predictive validity 
than a test of general mental ability. An additional advantage is that the 
results of reading-readiness tests are more easily discussed with parents 
and do not lead to premature generalizations about the child's general in- 
telligence. 

The Gates Reading Readiness Tests and the Metropolitan Readiness 
Tests are probably the most widely used group tests of reading readiness. 
The latter includes subtests on readiness for early work in arithmetic. Sev- 
eral other tests, together with information on reviews, are listed in Ap- 
pendix A. In interpreting the results of these readiness tests, the teacher 
should consider the test scores in combination with other data indicative 


of the child's readiness for first-grade learning experiences. 


PROGNOSTIC TESTS IN FOREIGN LANGUAGE The renewed interest in the 
teaching of foreign language has led to the development of the first prog- 
nostic test in foreign language published in many years. In 1958, Carroll 
and Sapon published their Modern Language Aptitude Test. Based on a 
five-year research study conducted at Harvard University, this test has been 
Shown to have predictive validity for the student's progress in learning 
almost any modern or classical language. Since the test is only moderately 
correlated with intelligence, it provides «new information" to the counselor 
and may be used in combination with intelligence test data in developing 
two-variable expectancy charts from local data. 


Two different approaches have 


PROGNOSTIC TESTS IN MATHEMATICS 
me tests, such as the 


been used in prognostic testing in mathematics. So 
Orleans Algebra Prognosis Test, measure the student’s speed and accuracy 
in learning material similar to that which he will encounter in the course. 
Other tests, such as the Jowa Algebra Aptitude Test, constitute an in- 
Ventory of the student's achievement in the underlying skills (that is, 
arithmetic computation, manipulation of numerical series, computations 
involving abstract concepts, and solution of problems involving dependence 
and variation). Since multiscore tests of general mental ability and voca- 
tional-aptitude test batteries usually include а subtest on numerical ability, 
the use of special prognostic tests in mathematics may decline. 


PROGNOSTIC TESTS IN SHORTHAND Prognostic tests have been used to 
considerable advantage in measuring aptitude for learning shorthand skills. 
The need for predicting success in shorthand is perhaps greater than the 


212 THE STUDY OF INDIVIDUALS 


need in any other secondary school subject, because (1) the mortality in 
shorthand courses is very high, (2) shorthand skills have limited value for 
students who do not use them vocationally and soon deteriorate if they 
are not maintained through regular practice, (3) many students with low 
general intelligence need to be guided away from a stenographic goal to 
a more suitable vocational choice, and (4) students’ cumulative records 
contain little relevant information. 

Tests of shorthand aptitude are listed in Appendix A. The Turse and 
the ERC tests seem to have high face validity as well as predictive validity. 
That is, students can more readily accept their test scores as relevant be- 
cause of the obvious similarity between the abilities required in the test 
exercises and those required in shorthand. 


PURPOSES FOR WHICH APTITUDE TESTS ARE USED 


The preceding chapter sections on types of aptitude tests have revealed 
that a bewildering variety of aptitude tests are available. In fact, many 
measurement textbooks have three or more chapters on aptitude testing, 
with separate chapters being devoted to (1) individual tests of general 
mental ability, (2) group tests of general mental ability, (3) aptitude test 
batteries, and (4) tests of special abilities. The reason that we have 
grouped all these tests in a single chapter is to emphasize their common 
functions and to indicate the responsibility of the test user to make dis- 
criminating choices among them in terms of his purposes in testing. 
Moreover, we believe that the current trend toward interpreting test data 
to parents and students may result in a trend away from the routine ad- 
ministration of tests designed to measure the construct of "general intelli- 
Bence" and toward the use of a variety of aptitude tests, each being selected 


as most appropriate to the school’s current purposes in testing. For ex- 
ample, a school staff might decide that: 


1. A readiness test would help best in grouping and diagnosis at the first-grade 


level and minimize risks of premature generalizations about a child's intelli- 
gence. 


. Two tests of general mental ability should be adm 
mentary school years, with the position of the test 
tests (Fig. 5.1) being affected b 
lation and by whether or not ach 


to 


inistered during the ele- 
on the spectrum of ability 
у the characteristics of the student popu- 


levement tests are routinely administered.2? 


27 [f achievement tests were not routinely administered, a test emphasizing school- 
developed abilities such as SCAT might be desirable. On the other hand, if such 
achievement tests are routinely given, a test that emphasizes novel test situations 
and abilities that are largely learned in nonschool Situations might provide more 
new information and help to identify children with unrecognized potential. 


The Measurement of Aptitudes 213 


3. At the junior high school level a test that provides separate scores in verbal 
and numerical abilities might be helpful in early decisions regarding elective 
courses. 

4. At the senior high school level an aptitude test battery with subtest scores 
related to future college achievement, as well as success in various occupa- 
tional groups, would be more adequate and more easily discussed with stu- 
dents and parents than a test of general mental ability. 

5. Teachers of art, music, and industrial arts might be encouraged to use special 


ability tests as an aid to students in self-appraisal. ' 

6. Prognostic tests in foreign language might be used to advantage in counsel- 
ing students regarding the advisability of taking regular or accelerated 
courses; while prognostic tests in shorthand might be administered to students 
beginning the commercial curriculum to help them estimate the prob- 
ability of their achieving an adequate proficiency level in shorthand. 


As we have already emphasized, the elementary schools are usually 
concerned with obtaining measures of general mental ability for each pupil, 
rather than profiles of mental abilities. If the school is concerned only with 
making the best assessment of the level of academic work of which a stu- 
dent is capable, for example, as one basis for grouping students, a good 
omnibus test of items predictive of academic achievement will do the job 
Most efficiently. However, such a test will add little in the way of “new 
information” for the large number of students who are making normal 
progress in school work. If a test is chosen that includes subtests on non- 
verbal items, the predictive validity of the test for scholastic achievement 
will be lower than if the same amount of testing time had been devoted to 
items involving verbal abilities. However, new information not provided 
by achievement tests will be obtained, and teachers may be able to identify 
students who have a higher level of learning ability than was evident from 
à verbal test alone. In a school with a large number of children from bi- 
lingual and/or underprivileged homes, routine use of a test providing 
Verbal and nonverbal 10° is advisable. In other schools, it may be advan- 
tageous to use verbal tests with their greater predictive efficiency, but to 
Provide supplementary testing with a nonverbal test for students who are 
retarded in their linguistic development. . ИР ы] 

High school students must make decisions concerning the advisability of 
attending different types of postsecondary schools, on the types of high 
School curricula best suited to their abilities and goals, and on the advis- 
ability of enrollment in certain specialized courses, such as shorthand. In 
Order to make these and other decisions wisely, the student needs the 
most adequate data he can obtain concerning his ability to progress satis- 
factorily in such new experiences. The use of adequate tests may help to 
avoid costly trial-and-error experimentation and the psychological conse- 
quences of failure. 

As the student reaches the eleventh or twelfth grade, profiles of aptitude 


214 THE STUDY OF INDIVIDUALS 


test data become especially valuable. Here the student faces more specific 
choices, of vocation or college major; and the need for reappraisal is 
indicated. Aptitude- and interest-test data obtained at this grade level have 
relatively high reliability, can be more meaningfully interpreted, and have 
greater significance for the more mature students. 


INTERPRETATION OF RESULTS FROM APTITUDE TESTS 
Interpretation of Results from Scholastic Aptitude Tests 


The results of scholastic aptitude tests have usually been interpreted in 
terms of MA's and IO's. A child's MA on a specific test of scholastic 
aptitude is the average age of children in the norming sample who did as 
well on the test as he did. 

If a six-year-old does as well on a scholastic aptitude test as the average 
eight-year-old in the norming sample, his MA is 8.0. The problems in- 
volved in interpreting age scores are considered in Chapter 2. If the test is 
quite a difficult one that contains many items suitable for eight-year-olds, 
we may be able to infer that this child has a level of mental development 
typical of the average eight-year-old. Certainly we know that he will be 
able to handle more advanced types of learning experiences, and will prob- 
ably progress more rapidly, than the typical first-grader. Obviously, how- 
ever, he has not developed many of the concepts and skills needed to do 
third-grade work (the grade in which most eight-year-olds are enrolled). 
The MA, however, despite its limitations, is the best single index of readi- 
ness for different levels of difficulty in intellectual activities, 

A child's 10, which is a measure of his rate of mental development, was 
traditionally obtained by the following ratio formula: 


MA 
IQ = —— x 100 
9 CA 
In this case, we would obtain: 
8 
10 = 6 X 100 = 133 


This six-year-old's rate of development appears to be at the rate of one and 
one-third years of growth in scholastic aptitude for each year of chrono- 
logical age. If we compare his MA s in successive elementary school grades 
with those for another child with an IQ of 100, we will find that the differ- 
ence between their mental ages will become increasingly large. If their 
ІО” on successive tests remain the same, the difference in MA by the time 


The Measurement of Aptitudes 215 


they reach the seventh grade at age 12, will be four years, rather than two 
years, as at present. Since the child with an IO of 133 has a much higher 
Tate of development than the average child, the disparity between their 
levels of scholastic aptitude, or expected achievement in intellectual tasks, 
will increase. 

If two children each had an MA of 6.0 when they entered the first 
grade, we cannot assume that each of them has a MA of 7.0 a year later 
when they enter the second grade. The child with a high IO (or rate of 
mental development) will have grown more than a year in mental age, 
while the dull child will have grown less. Hence, unless a test has been 
Very reecntly given, the MA must be estimated by the following version 
of the IO formula: 


IQ (СА) 

^ 100 

For the child we have been considering with an IO of 133, the computa- 
tion for age seven would be: 


МА = 


MA = 133 (0 оз 
100 

Ordinarily 84 months would have been substituted for the age, so that the 
MA would be found in months, rather than decimal fractions of a year. 
DISADVANTAGES OF THE RATIO METHOD OF COMPUTING 105 As we 
have explained, the 10 has traditionally been obtained by a ratio formula. 
If the student refers back to the different types of number systems in Chap- 
ter 2, page 63, he will find that when we divide one number by another, 
We assume that the measurement units are equal and that the scale has a 
meaningful zero point. Reference to the section on age norms will remind 
the student that the mental-age unit does not represent equal values at all 
ages. It is gratifying, therefore, to note that as mental ability tests are re- 
Vised, an increasing number of them are abandoning the ratio 10. . | 
The ratio method also introduced difficulties into the computation of 
intelligence quotients for unusually bright or unusually dull children. For 
example, Peter, a 10-year-old pupil in the fifth grade, does as = as the 
average 15-year-old pupil in the norming population on a fifth-grade intel- 
ligence test. His MA would, therefore, be 15 and his ratio IQ would be 
150. It is highly doubtful, however, that Peter would achieve as well as 
the average 15-year-old pupil on a test designed for this higher level of 
maturity. The use of the ratio method thus involves assumptions that are 
Dot valid for students of high school age ог for those pupils in the ele- 
mentary schools who are extremely bright or extremely dull. The deviation 


216 THE STUDY OF INDIVIDUALS 


IO, a normalized standard score, is now becoming more widely used. Since 
the 1960 revision of the Stanford-Binet has changed to the deviation 10, 
the three leading individual tests all use this method of obtaining IO's. 
According to this procedure, the score earned by each student on an 
intelligence test is simply compared with the scores of other students of his 
own age. His position is ascertained in a normal distribution for his own 
age group, and that position (actually a standard score) is translated into 
an intelligence quotient. According to this plan, Peter's score would be 
compared not with that of the average 15-year-old but with those of all 
10-year-olds in the norming population. If he excelled 98 percent of his 
own age group, his IO would be 130. This method of computing deviation 
IO's is used in the Pintner and Otis series of intelligence tests and in the 
Terman-McNemar Group Test of Mental Ability. The relationship of devi- 


ation ТО” to standard scores and percentile ranks is clarified in Appen- 
dix C. 


CONSTANCY OF THE IQ When the IQ was defined as rate of mental 
growth, the assumption was implicit that a measurable rate of growth did 
exist for each child, and that it was reasonably constant. Such an assump- 
tion, however, did not carry the implication that IQ's obtained from differ- 
ent tests and in different circumstances would agree perfectly. Not only do 
the samplings of intellectual abilities differ from test to test, but many other 
factors contribute to variability in test results. 

It is well to recognize that an IO is derived from the score obtained on a 
single intelligence test. It is not a direct measure of an individual's rate of 
mental growth but a converted score, based on comparing one measure 
of the student's performance on intellectual tasks with the performance of 
students in the norming sample on these same tasks, One cannot infer that, 
if the IO obtained on a later intelligence test is higher, the individual has 
become brighter since the time of his previous testing. 

The concepts of error variance and the standard error of measurement 
must be taken into account in any interpretation of intelligence test results. 
Investigations show that the average change in IO on repeated individual 
intelligence tests is approximately 5 points, that 20 percent of IQ’s change 
10 or more points, and that approximately 1 percent change 20 or more 
points. However, if the tests are group tests, and especially if the test used 
at the later testing is not the same as that used earlier, the differences may 
be even greater.** 


28 Florence L. Goodenough, "New Evidences on Environmental Influence on 
Intelligence," Intelligence: Its Nature and Nurture, 39th Yearbook, Part I (Chicago: 
National Society for the Study of Education, 1940), p. 358. 


The Measurement ој Aptitudes 217 


THE ASSUMPTION OF EQUAL OPPORTUNITY TO LEARN А student's intel- 
ligence or scholastic aptitude is estimated by obtaining a sampling of his 
behavior in a test situation in which the items have been especially selected 
to reveal his learning ability. In selecting items for intelligence tests, authors 
try to include items that are equally unfamiliar to students (described as 
novel situations) or equally familiar to students (in the sense that all stu- 
dents have had “equal opportunity to learn" the material included). 

Allison Davis has challenged the assumption that typical intelligence 
tests include materials that students of different socioeconomic classes have 
had equal opportunity to learn. He contends that students of the middle 
and upper classes have had greater opportunity and certainly greater moti- 
vation to extend their vocabulary, to learn abstract symbols, and to study 
reasoning problems in arithmetic than have students of the lower socio- 
economic classes.?? It is certainly evident that such a word as "sonata" in 
the vocabulary section of an intelligence test would be known by relatively 
more students from the higher socioeconomic classes than by those from 
the lower classes. Certainly an attempt should be made to include test 
items that do not give special advantage to children reared in the upper or 
middle classes, or to children reared in urban vs. rural environments. 

If one is concerned only with the intelligence test as a measure of the 
Child's readiness for learning activities, and as a predictor of success in 
Schoolwork, he can ignore such criticisms as Davis has made. That is, he 
can say that the child whose culture has limited his vocabulary develop- 
ment will obtain a somewhat lower scholastic aptitude score but will also 
tend to score lower on the criterion of success in school. Certainly if we 
omit from scholastic aptitude tests all items that discriminate against the 
culturally disadvantaged child, our test will be a less adequate predictor 
of school achievement. Research on the SAT test in Chapter 5 led to the 
conclusion that the unfairness lay not in the test, but in the culture. In 
other words, we should be cautious about the inferences we make from 
intelligence tests, limiting our inferences to those concerning the child’s 
Probable success in school work, rather than overgeneralizing about his 
general intelligence or his innate ability. w 

Davis and Eells attempted to develop a series of intelligence tests for 
the elementary grades that would minimize such cultural bias. Anastasi 
concluded, on the basis of a summary of 30 research studies, that Davis 
and Eells, in their attempt to develop a culture-fair test, had sacrificed 
predictive validity without eliminating cultural bias. That is, predictive 


29 Allison Davis and Kenneth Eells, Davis-Eells Test of General Intelligence or 
Problem-Solving Ability (New York: Harcourt, Brace & World, Inc., 1953). For 
research data on this problem, see Allison Davis and Others, Intelligence and 
Cultural Differences (Chicago: University of Chicago Press, 1951). 


218 THE STUDY OF INDIVIDUALS 


validity coefficients with both achievement tests and teachers’ ratings were 
uniformly lower than for conventional intelligence tests. This loss in pre- 
dictive validity might have been justifiable if the authors had achieved their 
goal of developing a test that would more adequately measure the learning 
ability of lower-class children. However, several studies have revealed that 
lower-class children perform as poorly on the Davis-Eells tests as on con- 
ventional intelligence tests.*° 
In studying the intelligence of people from quite different cultures (as in 
cross-cultural research studies), tests that rank on the lower end of the con- 
tinuum on verbal-educational loading should be used (see Fig. 5.1). 
Such tests, however, will tend to have low predictive validity for most cri- 
teria of success in educational or vocational activities. For the student who 
wishes to study the problem of cultural bias in intelligence testing, a num- 
ber of references are listed in the first section of the Sclected References 
for this chapter. 


VARIATIONS IN TEST CONTENT Scholastic aptitudes tests can sample 
evidences of intelligent behavior only as they are revealed in student re- 
sponses to the items selected for each specific test. Content varies consid- 
erably from test to test. Some tests place a greater premium on reading 
ability and vocabulary; others involve more items on numerical skills and 
understandings; while still other tests include a large percentage of items 
requiring perceptual speed and memory. A student's relative rank in his age 
group, and hence his IQ, will vary as a test includes more or less of the 
types of intellectual tasks on which he does best. Moreover, the types of 
test items that can be included in an intelligence test for kindergarten and 
first-grade children obviously differ from those that can be included in a 
test for older students, in which reasoning, judgment, and other higher 
mental processes can be more adequately tested. 

A student’s IQ should always be interpreted as his converted score on a 
specific intelligence test given at a specific time. That is, instead of saying 
that Jerry’s IQ is 120, one should say that Jerry has an 10 of 120 on a 
specific test (for example, the Pintner General Ability Test, Verbal Series, 
Intermediate, taken in the eighth grade). If Jerry obtains a lower IQ on a 
test of performance-type items or a higher IQ on a test placing a great 
premium on speed, the results are not necessarily inconsistent or contra- 
dictory. 


OTHER REASONS FOR VARIABILITY IN INDIVIDUAL IQS Although differ- 
ences in the content of intelligence tests constitute one of the major causes, 
there are several other reasons for the variability in results: 


зо Anne Anastasi, Psychological Testing (New York: The Macmillan Company, 
1961, p. 267). 


The Measurement ој Aptitudes 219 


The use of tests that have low reliability. 

Changes in the individual from one testing occasion to another, for example, 

changes in the subject's physical condition or emotional adjustment, includ- 

ing his attitude toward the test situation. 

3. Unreliability of test results for young children because of variations in their 
motivation, attentiveness when directions are given, and the like. 

4. The ratio method of computing the intelligence quotient (as explained 
above). 

5. Differences between the norming samples on which tests are standardized. 


юз 


Although many test publishers are now taking great care in selecting 
their norming samples, norming problems are still a factor in the variabil- 
ity of intelligence quotients. An illustrative study on this last factor in test 
variability indicated that the Otis Quick-Scoring Mental Ability Tests were 
standardized on a norming population that was unusually homogeneous, 
with an SD of only 10 IQ points as compared with 13 for the Pintner. 
Such differences in SD result in these tests giving almost identical IO's 
when the results are near 100 but increasingly divergent results as the IO's 
differ from the mean. For example, an Otis IO of 100 is equal to a Pintner 
IO of 99; however an Otis 10 of 120 is equal to a Pintner 10 of 123; and 
àn Otis IQ of 135 is equal to a Pintner IQ of 140. 


Interpretation of Results from Aptitude Test Batteries 


The interpretation of aptitude test scores in counseling interviews is con- 
sidered in Chapter 17. Here we consider such significant problems as the 
Probable effects of the interpretation of test results on the student's con- 
Cept of self and his aspirations for the future. The following section is 
concerned chiefly with those problems of interpretation that grow out of 


the limitations of the tests themselves. 


THE NEED FOR CONSIDERING OTHER EVIDENCE Perhaps the most im- 
Portant single principle in aptitude test interpretation 5 that the test re- 
Sults must be considered in the context of all other information on the 


#1 Roger T. Lennon, “А Comparison of Results of Three Intelligence Tests," 
Test Service Notebook, No. 4 (New York: Harcourt, Brace & World, Inc., n.d.). 
Copies available on request. The three tests studied were The Otis Quick-Scoring 
Mental Ability Tests: Gamma Test; The Pintner General Ability Tests: Verbal 
Series, Advanced Test; and The Terman-McNemar Test of Mental Ability. A pub- 
lication of The Bureau of Pupil Guidance, Chicago Public Schools, entitled "Equiv- 
alence of Intelligence Quotients of Five Group Intelligence Tests," gives IQ equiva- 
lents for five intelligence tests: The Lorge-Thorndike, Otis Quick-Scoring, California 
Mental Maturity, Kuhlmann-Anderson, and Pintner. A summary table of the results 
is given in Irving Lorge and Robert Thorndike, Technical Manual, The Lorge-Thorn- 
dike Intelligence Tests (Boston: Houghton Mifflin Company, 1962). 


220 THE STUDY OF INDIVIDUALS 


student's abilities (available in school records or supplied by students and 
parents). As has been mentioned, achievement test results are good indi- 
cators of aptitude, provided that the student has had typical opportunities 
for learning and motivation to learn. The student's pattern of marks in 
various subject fields may provide useful clues as to both aptitudes and 
interests. 

Achievement in exploratory courses (general shop, general music, and 
the like), work experience, extracurricular activities, and hobbies all re- 
flect the student's abilities and interests. However, caution must be used 
in making predictions regarding probable vocational success on the basis of 
students' interest and achievement in school activities. Participation in 
A. Cappella Choir or Senior Band, for example, may not be predictive of 
success in music vocations; a role in a school play may not indicate talent 
sufficient to succeed in the highly competitive field of dramatics; nor does 
the ability to sketch costumes necessarily qualify a student for a career in 
fashion illustration. 

Hobbies and extracurricular activities are especially valuable as indi- 
cators of interest in helping the student to choose among several occupa- 
tions, all of which have about the same requirements with respect to general 
or special mental abilities. They are less valuable, however, as indicators 
of aptitude, and may even be misleading for students desiring to enter the 
professions, which require rigorous preservice training, or vocations in the 
fine arts, which are highly competitive. Nevertheless, for all students, self- 
evaluation of try-out experiences in courses and activities help to provide 
a more adequate basis for interpreting the more objective test data. 


REPORTING APTITUDE-TEST RESULTS TO STUDENTS Although this sub- 
ject is considered at greater length in Chapter 17, it is appropriate at this 
point to emphasize that (1) a student's profile of aptitude test results can- 
not be drawn unless the tests to be compared are normed on the same 
population or on comparable norming groups; (2) the profile should be 
drawn on a scale that reflects the nature of percentile norms, as in Figure 
6.2; (3) the aptitude test data should be summarized together with achieve- 
ment test results, a record of marks by subject field, interest-inventory 
findings, and other relevant data; and (4) the presentation of findings 
should be handled in such a way as to promote maximum student interest 
and growth in self-appraisal and self-direction. 


THE FORECASTING OF SUCCESS AND FAILURE If a girl achieves high 
percentile ranks on the clerical speed and accuracy and the language usage 
subtests of the Differential Aptitude Tests, the counselor can justifiably 
inform her that she has a high probability of success in courses preparing 
for stenography. However, there are so many unmeasured factors that 


The Measurement ој Aptitudes 221 


affect vocational success that the degree of her success in a stenographic 
position cannot be predicted. If, on the other hand, the student has made 
low scores on the clerical-aptitude tests, and has had little success in re- 
lated school experiences, the counselor can indicate with some assurance 
that she has little likelihood of success in the prerequisite training program 
or on the job. Similarly, low scores on tests of music or art aptitude or 
manual dexterity, combined with supporting data from school or job ex- 
periences, would justify counseling students against entering vocations that 
require such abilities. High scores, however, would by no means be suffi- 
слепе to justify a prediction of a high degree of vocational success in those 
Occupations. 

Aptitude tests can be valuable in indicating the fields in which a student 
is unlikely to succeed. There is great need for more research, however, on 
the minimal or critical score, which would indicate minimum aptitude in 
critical abilities for various occupations. Research studies by the United 
States Employment Service have indicated that unique patterns of abilities 
for specific vocations do not, in fact, exist. Rather, for each job there ар-- 
pears to be only a set of minimal requirements. In other words, if an indi- 
vidual has at least the minimal level of ability in the various component 
characteristics required in a vocation, the probability of his success may 
depend chiefly on personality and motivational factors. In the use of the 
General Aptitude Test Battery (described in Table 6.4), critical, Or cut- 
off,” scores? are provided, which constitute a major basis for interpretation. 


THE NEED FOR MORE EXTENSIVE VALIDATION DATA The publishers of 
the Differential Aptitude Tests, the Flanagan A ptitude Classification Tests, 
and other new aptitude test batteries are committed to research programs 
involving (1) the norming of these tests for occupational groups ep (2) 
study of the relationship of aptitude test scores to educational and voca- 
tional success. The latest supplement to the DAT manual summarizes a 
large number of such research studies on this battery. 

In order to obtain maximal value from aptitude tests, the counselor 
Should have at hand, for each aptitude test, percentile or standard-score 
norms for many relevant occupational groups and for students in training 
for relevant occupations. The counselor should be able to tell the student 
how his scores compare with those of college or occupational groups with 
which he will have to compete. 

Data on the relationship of aptitud 
various high school and college curricula : 
expectancy tables for predicting grades 1 


e-test scores to students’ marks in 
la are also significant. In fact, local 
n college-preparatory and voca- 


32 The 33d percentile point, arbitrarily taken as distinguishing the "less able’ 


from the "more able" workers. 


222 THE STUDY OF INDIVIDUALS 


tional subjects are very useful in interpreting aptitude test scores to 
students.?* 


APTITUDE TEST RESULTS AND DIFFERENTIAL PREDICTION The chief 
concern of the student and his parents is with differential prediction—* Will 
I do better in law than in architecture?" “Ат I better fitted for teaching 
than for nursing?" Aptitude test scores do not provide direct answers to 
such questions. The counselor, however, can assist the student in sum- 
marizing and interpreting the available data. Techniques for differential 
prediction are presented in Chapter 17. 

Bennett and Doppelt point out that if the differences between student 
results on two tests are to be significant, the intercorrelation of the tests 
should not exceed .60, with lower values desirable. They have developed 
a nomograph by which one can use the reliability coefficients and inter- 
correlations of pairs of tests to determine the probability of finding reliable 
differences between pairs of student scores. 

The counselor needs aptitude tests that have the greatest promise for 
differential prediction because they (1) are the most reliable tests that can 
be obtained; (2) are accompanied by a wealth of validation data that help 
the student to compare himself with various groups; (3) show, by a pattern 
of low intercorrelations, that they measure abilities that are not closely 
interrelated; and (4) provide data that help guidance workers to interpret 
the statistical significance of differences between scores. 


SUMMARY STATEMENT 


Aptitude tests are designed to measure the individual's ability to progress in 
learning activities of some specified type. The validity of an aptitude test is 
judged in terms of the value of its scores in predicting future performance, 

Some aptitude tests are specially designed to measure aptitude for some 
subject or vocation, that is, to measure a combination of abilities related to 
future success in that subject or vocation. Other tests are designed to measure 
the individual's performance level in one or more discrete, unitary abilities, 
such as verbal comprehension or perceptual speed, which affect an individual's 
performance in many different subjects or Occupations. In prep 


aring aptitude 
tests of either type, authors attempt to include tasks which an 


€ (1) equally 


33 For suggestions on the summarization of local data, see "Expectancy Tables— 
a Way of Interpreting Test Validity," Test Service Bulletin No. 38 (New York: 
The Psychological Corporation, 1949). Illustrative expectancy tables are given in 
Chapters 4 and 17. В : | 

за George К. Bennett and Jerome E. Doppelt, Evaluation of Pairs of Tests for 
Guidance Use,” Educational and Psychological Measurement, vol. 8 (Autumn 


1948), pp. 319-323. 


The Measurement ој Aptitudes 223 


unfamiliar to all examinees or (2) equally familiar (in the sense that all 
examinees have had approximately equal opportunity to learn them). 

Although no attempt was made to describe all the leading aptitude tests, illus- 
trations of all the major types were given. Comparisons were made in the text 
and in summary tables that were designed to help students understand similari- 
ties and differences between: (a) an aptitude test battery and a multiscore test 
of mental abilities, (b) two leading individual intelligence tests, (c) two leading 
vocational aptitude tests, and (d) a performance and paper-and-pencil test of 
the same special aptitude (spatial visualization). 

Instead of routinely administering tests of general intelligence at several 
grade levels, school authorities might well choose from a variety of aptitude 
tests those that provide the most valuable information in making the types of 
decisions needed at each level. 

The results of scholastic aptitude tests have usually been interpreted in terms 
of МА and IQ's. The individual's mental age can be interpreted in terms of 
his level of mental maturity or his readiness to undertake learning tasks of a 
certain level of complexity. The IQ indicates the student's rate of mental 
development and is useful in predicting his rate of progress in learning activities 
appropriate to his mental maturity level. The limitations of age scores and of 
the ratio method of computing IQ's were considered. | | | 
, Teachers аге rightfully concerned about the range in the IQ's obtained for an 
individual during his school years. Variability in intelligence-test results can be 
reduced, in part, by a systematic program of administering intelligence tests at 
least biennially, careful selection of tests, desirable conditions of administration, 
and retesting of certain students. Teachers should recognize the need for sup- 
plementing verbal tests by nonverbal tests when there is a reading handicap, 
and of using individual tests when there is reasonable doubt as to the ability or 
willingness of a student to do his best work on a group test. . 

In studying intra-individual differences in aptitudes, it is essential that tests 
of the different aptitudes be normed on the same ог comparable norming sam- 
ples. Although the use of such tests in counseling is considered in Chapter 17, 
Several principles of interpretation were emphasized in this chapter: (1) Other 
evidence, such as achievement test data, students' marks, and the like, should 
be considered. (2) Caution should be exercised in reporting aptitude test results 
to students. (3) Aptitude tests forecast failure more reliably than success. 
(4) More extensive validation data are needed for use in test interpretation. 
(5) Aptitude tests results are frequently inadequate for differential prediction. 


SELECTED REFERENCES 


Cultural Factors Affecting Aptitude Test Scores 


ANASTASI, ANNE, "Some Implications of Cultural Factors for Test Construction,” 
Proceedings of the 1949 Invitational Conference on Testing Problems. 
Princeton, N.J.: Educational Testing Service, 1950, pp. 13-17. 

COLEMAN, WILLIAM, AND ANNIE W. WARD, “A Comparison of Davis-Eells and 
Kuhlmann-Finch Scores of Children from High and Low Socio-Economic 
Status,” Journal of Educational Psychology, vol. 46 (December 1955), 
рр. 463-469. 


224 THE STUDY OF INDIVIDUALS 


DARCY, NATALIE T., "A Review of the Literature on the Effects of Bilingualism 
upon the Measurement of Intelligence," Pedagogical Seminary and Journal 
of Genetic Psychology, vol. 82 (March 1953), pp. 21-57. . 

EELLS, KENNETH, AND OTHERS, Intelligence and Cultural Differences. Chicago: 
University of Chicago Press, 1951. | 

JONES, W. R., “A Critical Study of Bilingualism and Non-Verbal Intelligence," 
British Journal of Educational Psychology, vol. 30 (February 1960), pp. 
71-77. 

LEVINSON, BORIS M., “Subcultural Variations in Verbal and Performance Ability 
at the Elementary School Level,” Journal of Genetic Psychology, vol. 97 
(September 1960), pp. 149-160. 

NOLL, VICTOR H., “Relation of Scores on Davis-Eells Games to Socio-Economic 
Status, Intelligence Test Results, and School Achievement,” Educational 
and Psychological Measurement, vol. 20 (Spring 1960), pp. 119-129. 


Other Factors Affecting Validity of Aptitude Test Scores 


FRANKEL, EDWARD, “Effects of Growth, Practice, and Coaching on Scholastic 
Aptitude Test Scores,” Personnel and Guidance Journal, vol. 38 (May 
1960), pp. 713-719. 

FRENCH, JOHN W., AND ROBERT E. DEAR, “Effect of Coaching on an Aptitude 
Test,” Educational and Psychological Measurement, vol. 19 (Autumn 
1959), pp. 319-330. 

HOLLOWAY, H. D., “Effects of Training on the SRA Primary Mental Abilities 
(Primary) and WISC,” Child Development, vol. 25 (December 1954), 
pp. 253-263. 

MAURER, KATHERINE M., Intellectual Status at Maturity as a Criterion for Se- 
lecting Items in Preschool Tests. Minneapolis, Minn.: University of Minne- 
sota Press, 1946. 

SUPER, DONALD E., AND JOHN О. CRITES, Appraising Vocational Fitness by 
Means of Psychological Tests. New York: Harper & Row, Publishers, Inc., 
1962, Chapters 5, 6, 10-14. 


Stability of Aptitude Test Scores 


BAYLEY, NANCY, "Data on the Growth of Intelligence between 16 and 21 Years 
as Measures by the Wechsler-Bellevue Scale," Journal of Genetic Psychol- 
ogy, vol. 90 (March 1957), pp. 3-15. 

BRADWAY, KATHERINE P., CLARE W. THOMPSON, AND R. B. CRAVENS, “Preschool 
IQ's After Twenty-five Years," Journal of Educational Psychology, vol. 
49 (October 1958), pp. 278—281. 

MEYER, WILLIAM J., "The Stability of Patterns of Primary Mental Abilities 
among Junior High and Senior High School Students," Educational and 
Psychological Measurement, vol. 20 (Winter 1960), pp. 795-800. 

TYLER, LEONA E., "The Stability of Patterns of Primary Mental Abilities among 
Grade School Children," Educational and Psychological Measurement, vol. 
18 (Winter 1958), pp. 769-774. 


Uses of Aptitude Tests 


DAVIS, FREDERICK B., Utilizing Human Talent (Washington, D.C.: American 
Council on Education, 1947). 


The Measurement ој Aptitudes 225 


GHISELLI, E. E., Measurement of Occupational Aptitude (Berkeley, Calif.: Uni- 
versity of California Press, 1955). 

Identification and Guidance of Able Students (Washington, D.C.: American 
Association for the Advancement of Science, 1958). 

STROUD, J. B., "The Intelligence Test in School Use: Some Persistent Issues," 
Journal of Educational Psychology, vol. 48 (February 1957), pp. 77-85. 

SUPER, DONALD E., ed., The Use of Multifactor Tests in Guidance (Washington, 
D.C.: American Personnel Guidance Association, 1958). Reprinted from 
Personnel and Guidance Journal, vol. 35 (September 1956), pp. 9-51. 


Prognostic Testing in Subject Areas 


BANHAM, KATHARINE M., “Maturity Level for Reading Readiness—A Check List 
for the Use of Teachers and Parents as a Supplement to Reading Readiness 
Tests," Educational and Psychological Measurement, vol. 18 (Summer 
1958), pp. 371-375. . . 

CARROLL, JOHN B., “A Factor Analysis of Two Foreign Language Aptitude 
Batteries," Journal of General Psychology, vol. 59 (July 1958), pp. 3-19. 

FARNSWORTH, PAUL R., Musical Taste: Its Measurement and Cultural Nature, 
Stanford University Publications in Education-Psychology, vol. II, No. 1. 
(Stanford, Calif.: Stanford University Press, 1950), Chapters 3-4. И 

HORN, CHARLES A., AND LEO F. SMITH, "The Horn Art Aptitude Inventory, 
Journal of Applied Psychology, vol. 29 (October 1945), pp. 350-355. 

SALOMAN, ELLEN, *A Generation of Prognosis Testing," Modern Language 
Journal, vol. 38 (October 1954), рр. 299-303. es 

WING, H., “Tests of Musical Ability and Appreciation: An Investigation into the 
Measurement, Distribution and Development of Musical Capacity, British 
Journal of Psychology, Monograph Supplement 27, 1948. (Chicago: 
University of Chicago Press, 1949). 


General and Theoretical References 


ctual Ability,” British Journal of 


954), pp. 76-90. 
" Teachers College 


BURT, CYRIL, "The Differentiation of Intelle 
Educational Psychology, vol. 24 (June 1 Ws 
DYER, HENRY s., "A Psychometrician Views ген Ability, 
Record, vol. 61 (April 1960), pp. 394—403. . | 
GUILFORD, J. Р., "Three Faces o£ Intellect," American Psychologist, vol. 14 
(August 1959), pp. 469—479. 
HUMPHREYS, LLOYD G., AND PAUL L. В 
Tests," Encyclopedia of Educational 
Company, 1950, pp. 600-612. тА 
KETTNER, NORMAN W., J. P. GUILFORD, AND PAUL R. CHRISTENSEN, “А Factor- 


Analytic Study across the Domains of Reasoning, Creativity and Evalua- 

tion,” Psychological Monographs, vol. 73, No. 9, Whole No. 479 (Wash- 

ington, D.C.: American Psychological Association, 1959). ч 
Achievement Tests as Predictors of 


LEVINE, ABRAHAM S., "Aptitude Versus Y 
Achievement," Educational and Psychological Measurement, vol. 18 (Au- 


tumn 1958), pp. 517—525. 
VERNON, Р. E., The Structure of Human 
Sons, Inc., 1950). 


OYNTON, “Intelligence and Intelligence 
1 Research. New York: The Macmillan 


1 Abilities (New York: John Wiley and 


226 THE STUDY OF INDIVIDUALS 


References re Specific Aptitude Tests 


BENNETT, G. K., AND OTHERS, Counseling from Profiles: A Casebook for the 
Differential Aptitude Tests (New York: The Psychological Corporation, 
1953). 

BURKE, HENRY R., "Raven's Progressive Matrices: A Review and Critical Eval- 
uation," Journal of Genetic Psychology, vol. 93 (December 1958), pp. 
199—228. 

GWYNNE-JONES, H., "The Evaluation of the Significance of Differences between 
Scaled Scores on the WAIS: Perpetuation of a Fallacy," Journal of Con- 
sulting Psychology, vol. 20 (August 1956), pp. 319-320. 

HARRIS, DALE B., Measuring the Psychological Maturity of Children: A Revision 
and Extension of the Goodenough Draw-A-Man Test. (New York: Har- 
court, Brace & World, Inc., 1961). 

JONES, LYLE V., "Primary Abilities in the Stanford-Binet, Age 13," Journal of 
Genetic Psychology, vol. 84 (March 1954), pp. 125-147. 

LITTELL, WILLIAM M., “The Wechsler Intelligence Scale for Children: Review 
of a Decade of Research," Psychological Bulletin, vol. 57 (March 1960), 
pp. 132-156. 

MCNEMAR, QUINN, "On WAIS Difference Scores," Journal of Consulting Psy- 
chology, vol. 21 (June 1957), pp. 239-240. 

MEYER, WILLIAM H., AND A. W. BENDIG, "A Longitudinal Study of the Primary 
Mental Abilities Test," Journal of Educational Psychology, vol. 52 (Feb- 
ruary 1961), pp. 50-60. 

SCHUTZ, RICHARD E., “Factorial Validity of the Holzinger-Crowder Uni-Factor 
Tests,” Educational and Psychological Measurement, vol. 18 (Winter 
1958), pp. 873-875. 

SHARP, H. C., AND L. M. PICKETT, “The General Aptitude Test Battery as a Pre- 
dictor of College Success," Educational and Psychological Measurement, 
vol. 19 (Winter 1959), pp. 617-623. 

TERMAN, LEWIS M., AND MAUD A. MERRILL, Measuring Intelligence. Boston: 
Houghton Mifflin Company, 1959. 

United States Department of Labor, Bureau of Employment Security, Guide to 
the Use of General Aptitude Test Battery: Section III. Development. 
(Washington, D.C.: Government Printing Office, 1958). 

WECHSLER, D., The Measurement and Appraisal of Adult Intelligence, 4th ed. 
(Baltimore: The Williams and Wilkins Company, 1958). 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


4. Summarize early developments in the mental testing of individuals. What 
contributions were made by Binet? By Terman? When were group intelligence 
tests developed and what factors led to their development? 

2. How does the basic approach to the construction of intelligence tests differ 
from the basic approach used in constructing achievement tests? 

3. Study the reviews of one aptitude test from one of the Mental Measure- 
ments Yearbooks. Summarize the evidence given concerning the concurrent and 
redictive validity of that test. | К р ў 

4. Examine the manuals for two leading aptitude test batteries. Summarize 
the research data presented on the construction and validation of each test. 


The Measurement ој Aptitudes 227 


5. Compare subtests from two aptitude test batteries for a single aptitude, 
Such as spatial relations or verbal comprehension. Do the tests involve essentially 
the same task for the student? How do they compare with respect to working 
time, clarity of instructions, norms available, validation data, and the like? 

6. For what purposes are tests of single aptitudes used? What are the advan- 
tages and disadvantages of using two or more of these tests, in comparison with 
an aptitude test battery measuring the same abilities? 

7. Assume that you are a high school counselor. Prepare notes for a talk to 
teachers on the interpretation of results on aptitude test batteries. 

8. In many cases a child's language IQ is 10 or more points below his non- 
language IQ. What are the possible reasons for such large differences? 

9. What are the uses for individual intelligence tests in a school in which 
group intelligence tests are regularly administered? 

10. Why might a specific test of scholastic aptitude have low validity for 
some children? 

11. Distinguish between tests of manual dexterity and mechanical aptitude. 
Cite an example of a test of each type. How is each type used in vocational 
guidance? 

12. What is experiential maturity and 
readiness in the first grade? What informa! 
evidence on a pupil’s readiness for reading? 

13. In the following list of educational problems, indicate whether the IQ 
or the MA should be given greater consideration: 


a. Estimating readiness for reading in the first grade | 
b. Determining the optimum grade placement for a student entering ele- 


mentary school 3 . : 
c. Estimating the amount of practice needed to attain mastery of certain 


arithmetic skills | : 
d. Selecting the candidates for a special class for the gifted 
e. Classifying students within a single grade level into ability groups. 


14. List the criteria frequently used to validate an intelligence test, and 


evaluate the validity and the reliability of each criterion. 

15. What significant factors in musical aptitude are not measured by the 
Seashore tests? What are the values and limitations of this battery as a predictor 
of accomplishment in music? | | . 

16. Describe and evaluate one prognostic test In your major subject field. 
Why is it important to have local validation data for such tests? | 

17. Prepare a summary statement concerning the advantages and disadvan- 
tages of prognostic testing in foreign languages, mathematics, or some other 


Subject field. 


why is it an important factor in reading 
1 methods can a teacher use to obtain 


Тре Measurement of 
Interests and Attitudes 


In the remaining chapters of Part II, we turn our attention to tests of in- 
terests, attitudes, and personality traits, in which we attempt to measure 
the typical performance of individuals, rather than their maximum per- 
formance. With tests of this type, test results may vary with the examinee’s 
perception of the test situation. In some situations, the examinee may 
frankly report his typical attitudes or behaviors as he perceives them. On 
other occasions, he may modify his answers in an attempt to make a good 
impression. 


THE NATURE OF INTERESTS 


During the past decade the use of interest inventories in secondary schools 
has markedly increased. Many interest inventories can be self-administered 
and self-scored; interest-inventory results intrigue students (without threat- 
ening them, as aptitude-test results may do); and the interpretation of 
interest profiles appears to be a process that can be safely attempted in 
group guidance classes by guidance teachers and counselors who have had 
training for such work. The fact that an interest inventory merely cate- 
gorizes his own responses helps the student to interpret his scores as a 
mirror of his own reactions, rather than a mysterious dictum from some 
authority. 

A person’s interests are the product of interaction between (1) inherited 
bases of ability and temperament and (2) many environmental factors, 
notably the opportunities he has had for pursuing certain interests and the 
value placed on their development by persons whose approval he values. 
In some instances, a young child’s aptitudes result in his receiving satisfac- 


228 


The Measurement of Interests and Attitudes 229 


tion and approval for his successes; in other cases, early failures result in 
discouragement or early successes are met with indifference or disapproval. 


Approaches to the Identification of Interests 


Super and Crites? distinguish four major interpretations of the term “in- 
terest," associated with four methods of obtaining data on student interests. 


1. Expressed interest—the verbal profession of interest in an activity or occu- 
pation. The student simply expresses a liking, or indicates his dislike, for a 
particular activity or vocation. The significance of such expressions of inter- 
est varies with the maturity and experience of the individual. In some cases, 
expressed interests represent temporary whims or fantasies. 

2. Manifest interest—as evidenced by participation in an activity or occupation. 
A boy who is a radio “ham,” a girl who is active in the dramatics club, a 
student who does sports reporting for the school paper are manifesting their 
interests through actual participation. Manifest interests tend to be more 
stable than expressed interests since they are based on actual experience. 

у This approach to the identification of interests, however, has serious 
limitations. Sometimes the student participating in an activity is more inter- 
ested in its concomitants or by-products than in the activity itself. For ex- 
ample, the girl in the dramatics club may be more interested in the social 
contacts and prestige to be derived from club membership than in dramatics 
as an art or form of self-expression. She may later seek these same goals 


through other activities. 
The manifestation of interests may be limited by financial considerations 
Or other environmental factors. Hence, although manifest interests provide 
clues to possible educational and vocational goals, the absence of a specific 
interest may reflect only lack of environmental opportunity to develop that 
Interest, 

3. Tested interest—as measured by objective tests 
mation rather than an inventory of reported interests. The use of such tests 
as the Michigan Vocabulary Profile Test as a measure of interest is based on 
the assumption that a stable interest results in an accumulation of relevant 
information and a corresponding growth in specialized vocabulary. 

4. Inventoried interest—as measured by lists of activities or occupations to 
Which the student responds by an expression of liking or preference. Such 
inventories superficially resemble questionnaires on expressed interests. How- 

ever, many activities that have an indirect relationship to vocational choice 

are included. In answering the inventory items, the examinee records a series 

Of self-perceptions that are summarized in such a way as to reveal their 

similarity to those of workers in different occupations (or to students of 

his sex and grade level). The scores of each student can be interpreted 

(by means of norms) as reflecting a pattern of relatively high or low 


interests in various fields. 


of vocabulary or other infor- 


1 Donald E. Super and John О. Crites, Appraising Vocational Fitness by Means 
а ни Tests (New York: Harper & Row, Publishers, Inc., 1962), pp- 
—379. 


230 THE STUDY OF INDIVIDUALS 


Almost all of the interest measures that have been published and stand- 
ardized for school use are of the fourth type. Experience and research 
suggest that interest inventories can be valuable aids in vocational guid- 
ance. The expressed interests of adolescents may be based on glamorized 
stereotypes of occupations, rather than an awareness of the specific activi- 
ties involved. Evidence from the first three sources, however, is useful in 
studying the validity of published inventories and in supplementing inven- 
tory results in the counseling of individual students. 

When a student's results on an interest inventory differ from his ex- 
pressed interests, the counselor should not assume that the interest inven- 
tory is more valid. Instead, the counselor should investigate whether the 
expressed interest is a longstanding or temporary one, and whether it is 
based on actual experience and mature consideration. The interest inven- 
tory does have the advantage of obtaining the students' reactions to a large 
sampling of items and of providing, through the use of converted scores, a 
means of comparing the students’ interests with others of his sex and аге. 
Berdie stresses the importance of considering both expressed and inven- 
toried interests: *As long as measured [inventoried] interests have a rele- 
vancy for vocational satisfaction and as long as self-estimated [expressed] 
interests play an important role in the deliberations of individuals, both 
types of interests must be considered."* 


The Relationship of Interests and Aptitudes 


It is generally accepted that interests and abilities are related—that is, 
a person tends to develop interests in activities that he performs easily 
and well and to shun those that are more difficult for him. The relation- 
ship, however, is by no means simple and clear-cut. Adkins and Kuder 
found only one correlation above .30 when they correlated scores on the 
Chicago Primary Mental Abilities Test and the Kuder Preference Record.* 
In Table 7.1, which summarizes data on the interrelationships of scores 
on the Differential Aptitude Test and the Kuder Preference Record, very 
few appropriate pairings of aptitude and interest scores showed a signifi- 
cant relationship. In fact, when the results for the three grade levels (pre- 
sented in the manual) are studied, the only appropriate relationships that 
were consistently significant were between boys' scores in DAT mechanical 


2R. Е. Berdie, "Scores on the Strong Vocational Interest Blank and the Kuder 
Preference Record in Relation to Self-Ratings," Journal of Applied Psychology, vol. 
34 (February 1950), pp. 42-49. 

3 D. C. Adkins and С. Frederic Kuder, "The Relation between Primary Mental 
Abilities and Activity Preferences,” Psychometrika, vol. 5 (December 1940), $e. 


251-262. 


The Measurement of Interests and Attitudes 231 


reasoning and their Kuder interest scores in the mechanical and scientific 
areas. As Wesman warns, 


Experienced counselors may not need the reminder these data contain; to less 
experienced counselors, the results may well serve as a warning not to base 
counseling on interest scores without positive information with respect to the 
appropriate aptitudes and abilities of the student.* 


Table 7.1 
Coefficients of Correlation between Subtest Scores of the Differential 
Aptitude Tests and Scales of the Kuder Preference Record 


OO eS С 


Kuder Scales Coefficients of correlation? with DAT tests 

VR NA AR SR MR CSA SPELL. SENT. 
Mechanical 03 12 14 к; 40* 09  —04 —06 
Computational 16 22 28*  .19 20  .06 —.05 00 
Scientific 46% — 27* 36* —.13 32* 19 28* —27* 
Persuasive —18 -—12 22 -—12 -—05 -—12 -—14 —.17 
Artistic —19  —25* —17 .11 —03  —416 35* —.12 
Literary о оо оз —1 -—2 04  28* 12 
Musical 3 05 —06 22 -—001 15 44 3X 
Social Service —18 -—06 —10 -—11 —15 —13 106  —16 
Clerical —20 —04  —13 10 18 JË 17 —20 


== 


Source: Data presented for a group of 65 10th-grade boys in an Ames, lowa, high school. 


Corresponding data for grades 11 and 12, as well as corresponding data for girls, are given 
in George K. Bennett, Harold G. Seashore, and Alexander G. Wesman, Manual for the Differ- 
ential Aptitude Tests, 3d ed. (New York: The Psychological Corp., 1959), p. 75. 


ciently high that the chances are 


a 
Only those coefficients followed by an asterisk are suffi 
er than 0.0. For the full names of 


9. 
5 out of 100 that the “true r” between these tests is great 


р, 
AT tests, the reader is referred to Figure 6.2. 


Perhaps the most helpful statement for the counselor is that interest affects 
the direction of effort; ability, the level of achievement." 
We cannot infer from the low r's in Table 7.1 and those from similar 
Studies, that there is little relationship between interests and aptitudes. The 
relationship may be an exceedingly complex one. Our common sense tells 


+ Alexander G. Wesman, "The Differential Aptitude Tests," Personnel and Guid- 


ange Journal, vol. 31 (December 1952), p. 169. 
5 Super and Crites, op. cit., p. 448. 


232 THE STUDY OF INDIVIDUALS 


us that interests are probably related to intraindividual differences in apti- 
tude; and a research study by Segel? found considerable support for this 
hypothesis. . | | 

As measurement techniques and research procedures improve, higher 
rs may be found. For example, Crites* discovered that individuals with 
average intelligence tended to score higher in "interest in technical occu- 
pations” than individuals of either above-average or below-average intelli- 
gence. The relationship between interest and aptitude scores in this study 
was a curvilinear one; that is, the relationship between students’ scores on 
aptitudes and their scores in related interest areas would be represented 
by a curve rather than a line. Thus the degree of relationship, or the 
predictability of one from the other, would be underestimated by Pearson r 
(which assumes a linear relationship between the variables studied). The 
possibility of curvilinear relationships between aptitudes and certain inter- 
est areas needs to be studied further; until such researches have been 
completed, however, one cannot generalize concerning the relationship 
between interests and aptitudes in specific fields. One can be quite sure, 


however, that inferences regarding level of aptitude cannot be made on the 
basis of interest inventory scores. 


The Relationship of Interests and Personality Traits 


The processes of interest development and personality formation are so 
complex that it is impossible to measure the relative contributions of per- 
sonality traits to the development of interests. There is considerable evi- 
dence, however, that personality factors do play a significant role in the 
development of vocational interests and the making of vocational choices. 
Neurotics are more likely than normals to be interested in “talent” and social 
service occupations for which they lack the requisite aptitudes.* Darley 
found that persons in the business contact and social service fields tended 
to be better adjusted socially than those in literary and technical occu- 
pations.? 

An individual's occupational choice represents a social role through 
which he seeks self-realization. According to Darley and Hagenah, apti- 


6 David Segel, “Differential Prediction of Scholastic Success," School and Society, 
vol. 39 (January 20, 1934), pp. 91-96. 

7J. O. Crites, "Intelligence and Adjustment as Determinants of Vocational Interest 
Patterning in Late Adolescence." Unpublished doctoral dissertation, Columbia Uni- 
versity, 1957, cited in Super and Crites, op. cit., p. 172. 

s C. Н. Patterson, "Interest Tests and the Emotionally Disturbed Client," Educa- 
tional and Psychological Measurement, vol. 17 (Summer 1957), pp. 264-280. 

ој. С. Darley, Clinical Aspects and Interpretations of the Strong Vocational 
Interest Blank (New York: The Psychological Corporation, 1941). 


The Measurement of Interests and Attitudes 233 


tudes help to determine the level at which an interest may be developed; 
but an individual's personality needs may be major determinants of his 
interests and his vocational choice.’ 


Stability of Vocational Interests 


Strong's extensive research has shown that, although meaningful results 
can be obtained with able students of 14 and 15, use of his inventory with 
students below age 17 is not generally recommended." Strong found an 
increase in stability of interests even during college years. When college 
graduates were retested nine or ten years after the first administration of 
the Strong interest inventory, the average retest correlation was .56 for 
those first tested as freshmen and .71 for those first tested as seniors.** The 
fact that Taylor found an average test-retest correlation of .52 for 11th- 
grade students retested six years later indicates fairly satisfactory stability 
of interest scores for senior high school students.'* 

The relative instability of vocational interests in the earlier years does 
Not rule out the use of an inventory in the ninth or tenth grade but does 
Suggest that greater caution should be exercised in interpreting the results. 
In one research study the Kuder Preference Record was administered in 
the 9th grade and again in the 12th grade." For only three-fourths of the 
Students did their two highest interests in grade 9 remain among their three 
highest in grade 12. Similarly, for only three-fourths of the students did 
their lowest 9th-grade interest area remain among their three lowest in 
grade 12. The problem of reading difficulty also must be considered in se- 
lecting an interest inventory for use with younger students.” 


TYPES OF INTEREST INVENTORIES 


Test authors have used a variety of approaches in the development of 
Interest inventories. Two of these approaches (illustrated by the Strong 
and Kuder inventories) are described in this chapter section. 


*° 7. G. Darley and Theda Hagenah, Vocational Interest Measurement. (Minne- 
apolis, Minn.: University of Minnesota Press), рр. 190-193. 
11 Ibid., рр. 419—426. 
5 2B K Strong, Jr., Vocational Intere. 
tanford University Press, 1943), p. 363. " 
їз К. Taylor, “Reliability and Permanence of Vocational Interests of Adolescents, 
Journal ој Experimental Education, vol. 10 (September 1942) pp. 81-87. 

." George G. Mallinson and William Crumbine, “An Investigation of the Sta- 
bility of Interests of High School Students," Journal of Educational Research, vol. 
45 (January 1952), pp. 369-383. 

15 B. Ѕіећге, “The Reading Difficulty of Interest Inventories," Occupations, vol. 
26 (June 1947), pp. 95-96. 


sts of Men and Women (Stanford, Calif.: 


234 THE STUDY OF INDIVIDUALS 


These two approaches to the measurement of interest are similar to the 
two major interpretations of "aptitudes" and the corresponding approaches 
to aptitude testing, discussed in the preceding chapter. The student will 
recall that some aptitude tests, for example, the prognostic tests, measured 
a combination of abilities that represented aptitude for a subject field. 
These tests tended to be quite heterogeneous with respect to the factors 
of mental ability measured. Another approach to aptitude testing was to 
construct homogeneous tests of specialized aptitudes, designed to measure 
discrete, unitary abilities. 

In a sense, the original work on the Strong inventories is analogous to 
the first approach used in aptitude testing; Strong developed his interest 
scales on the basis of empirical data for individual items, selecting his items 
on the basis of their predictive validity for the criterion of occupational 

membership and making no attempt to develop homogeneous scales of 
unitary traits or constructs. Kuder’s approach to interest measurement 
represents the second or unitary-trait approach. Interest dimensions were 
interpreted as unitary traits, and every attempt was made to increase the 
homogeneity of each interest scale, as is appropriate in measures of traits 
or constructs. 


An Inventory Based on Empirical Study of Interests 


Strong, who developed the first interest inventory based on extensive 
research, approached the problem by studying the ways in which the inter- 
ests of persons successfully employed in selected occupations differed from 
those of men of similar age, selected at random from occupations usually 
entered by college-educated men. For each item, the percentage of men in 
that occupation was compared with the percentage for “men-in-general.” 
The following example illustrates the basis for assigning weights to item 1 
“actor” on the interest scale for Engineers. 


PERCENT PERCENT PERCENT 
GROUP LIKE INDIFFERENT DISLIKE 
Engineers 9 31 60 
Men (general group) 21 32 47 
Difference —12 MAI 13 
Weight assigned =j 0 +1 


On the basis of these empirical data, a “like” response to the vocation of 
actor is scored —1 on the engineer interest scale, while the response of dis- 
like to this item is scored +1. An examinee's score on the engineer scale, 
for example, would be the sum of the weights assigned to his responses. 

On the basis of his research, Strong assigned numerical weights to re- 


The Measurement of Interests and Attitudes 235 


sponses of “like,” “indifferent,” or “dislike” to each of the inventory items 
for each of his occupational scales. In each case the weights were deter- 
mined by comparing the responses of men in a given occupational group 
(for example, actor, aviator, sales manager) with those of the combined 
groups of “men in general.” For example, the greater the difference be- 
tween the percentage of engineers who liked, disliked, or were indifferent 
to each item and the percentage of "men in general" who had the same 
response, the larger the weighting given that response as an indication of 
interest in the vocation of engineer. The weights in the revised inventory 
range from +4 to —4. A student who receives a high, or A, rating on the 
engineer scale of Strong's inventory has revealed interests similar to those 
reported by engineers in the norming sample. The letter ratings are as- 
signed so that the top 69 percent of successful workers in the occupation 
(for example, engineers) would receive A, and the lowest 2 percent would 
receive C.16 

The items in Strong's inventories include lists of (1) occupations at and 
above the skilled level, (2) school subjects, (3) amusements, (4) recrea- 
tional and vocational activities and club offices, (5) well-known persons 
exemplifying occupational stereotypes ог personality attributes, (6) factors 
affecting vocational satisfaction, and (7) self-rating questions on the pres- 
ent abilities and characteristics of the respondent. Different forms are 
provided for men and women. 

Keys for at least 47 occupations are available for the SVIB for men, 
While the form for women can be scored for at least 27 occupations, The 
laborious, expensive scoring of the Strong inventories has constituted a 
major drawback to their use at the secondary school level. Research has 
demonstrated that narrowing the range of weights so that all responses 
аге weighted as +1, 0, or —1 would result in different counseling in one 
twelfth to one sixth of the cases." Hence, Strong does not recommend 
such à simplification, contending that the higher validity justifies the scor- 
Ing cost of approximately one dollar per student? 

As a result of factor-analysis studies of the vocational scales, several 


pational criterion group were changed 
10. Then, the T-scores were trans- 
45 or over; В+, 40-44; B, 35-39; 


15 Raw scores on the SVIB for each occu 
to T-scores having a mean of 50 and an SD of 
lated into letter grades as follows: A, T-scores of 
B—, 30-34; C+, 25-29; C, below 24. The student can check the percentages that 
Would fall in each group in a normal distribution; for example, approximately 69 
Percent of the occupational criterion group would have T-scores of 45 or over, 
Corresponding to an А grade. 

17 Б. К. Strong, Jr, "Weighted vs. Unit Scales, 
chology, vol. 36 (April 1945), pp. 193-216. 
‚ The names and addresses of organizations offering scoring services are listed 
in the test manual. Special answer sheets are required for scoring by the IBM 
electrical test-scoring machine or by the Hankes method. New scoring methods will 
undoubtedly reduce scoring costs. 


» Journal of Educational Psy- 


236 THE STUDY OF INDIVIDUALS 


group scales have been developed." These factor-analysis studies have 
proved valuable in classifying and interpreting the ratings received in spe- 
cific occupations. When related occupations are grouped together (as they 
are on the report sheets supplied with the inventory), a type of pattern 
analysis is facilitated. If a student who has scored A on the physician 
scale (his expressed vocational choice) also has received A's ог B+’s in 
the related occupations of group I, the counselor has a better basis for 
encouraging his choice (from the point of view of interest and job satis- 
faction) than if his ratings in these related occupations were largely B's 
and C's. 

Darley makes a helpful distinction among primary, secondary, and 
"reject" patterns in the various occupational groups. He defines primary 
interest patterns as those occupational groups in which the letter ratings 
received by the student are predominantly A and B+; secondary patterns 
as those occupational groups in which В+ and B ratings predominate, and 
reject patterns as those fields in which the students’ scores fall predomi- 
nantly in the “chance-score” zone (the gray area of the test profile). He 
contends that it is more helpful to know that a student has primary in- 
terests in the scientific and literary occupations with a secondary pattern 
in social welfare than to know that he made A’s as psychologist, physician, 
physicist, personnel director, and the like. 

McArthur and Stevens,? on the basis of their 14-year follow-up study 
of college men, found that the scales for specific occupations had higher 
predictive validity than did Darley’s pattern analysis. Pattern analysis, 
however, is usually more easily understood by students; and it is indispen- 
sable in considering vocational choices for which no occupational scale 
exists. Darley and Hagenah* warn against using only the group keys in 
counseling students. It is quite possible that a student might have a high 
Score on an occupation within a group and yet not have a high score on 
the group scale in which that occupation is classified. 

Three nonoccupational scales for the SVIB have been in use for many 
years: the masculinity-femininity scale, an interest maturity scale, and an 
occupational level scale. The occupational level scale measures the simi- 
larity of one’s interests to those of men in higher- or lower-status occupa- 
tions. A specialization scale has been more recently developed, which 


19 These clusters are enumerated in the footnote to Table 7.2. Scores on these 
scales do not measure degree of interest in a field but rather degree of similarity 
between the examinee’s interests and those of successful workers in the occupation, 
or group of occupations. | 

?0 C, McArthur and Lucia B. Stevens, "The Validation of Expressed Interest as 
Compared with Inventoried Interests: A Fourteen-Year Follow-Up," Journal of 
Applied Psychology, vol. 39 (June 1955), pp. 184—189. 

21 Darley and Hagenah, op. cit., p. 34. 


The Measurement of Interests and Attitudes 237 


measures the extent to which one's interests resemble those of people who 
specialize as compared with those who might be described as generalists 
(for example, general practitioners in medicine). 

Most of the occupational keys for the SVIB were based on the responses 
of men employed in the 1920s and 1930s, The need for updating one occu- 
pational key, that for psychologist, has been demonstrated. At the time 
that the psychologist scale was developed in 1928, Fellows of the American 
Psychological Association (who served as the research group) were largely 
experimental psychologists working on laboratory studies. Although 82 
percent of the early sampling made A scores, a study made twenty years 
later revealed that only 52 percent of a representative group of psycholo- 
gists made A scores. In fact, Kreidt** revised the psychologist scale in 
terms of his findings and developed scales for specialties within psychology. 
A recent synthesis of research findings on this subject, however, revealed 
that most of the occupational fields have not shown significant changes.** 

Research has shown that women’s occupational interests are not so well 
defined as are those of men. High correlations have been found between 
the housewife scale and scales for five other occupations. In fact, the 
"home-vs.-career decision" seems to overshadow other differences in vo- 


cational interest. 


An Inventory Measuring Interests as Unitary Traits 


The construction of the Kuder Preference Record, Vocational was based 
Оп the assumptions that (1) there are basic interest groups and (2) the 
Student's pattern of interest can be best measured by requiring him to use 
the forced-choice method of requiring him to express his preferences 
among activities. 

The method of construction of the Kuder inventories was markedly dif- 
ferent from that used by Strong. The first step was the preparation of a 
large number of items that appeared to measure interest in activities in 
certain areas, such as literary or clerical. A lengthy preliminary edition of 
such items was administered to a large, unselected group of people and 
scored according to an a priori key, based on the author’s judgment. Each 
of the items was then studied for its ability to differentiate between persons 
who made high and low total scores on the interest scale. Thus, a set of 
items was obtained in each interest field that was homogeneous or had high 
Internal consistency. 

?* P. Н, Kreidt, “Vocational Interests of Psychologists,” Journal of Applied Psy- 
thology, vol. 33 (October 1949), pp. 482-488. 

? W. L. Layton, ed., The Strong Vocationa 
(Minneapolis, Minn.: University of Minnesota Press, 1960). 

*4 Super and Crites, op. cit., PP- 446-448. 


1 Interest Blank: Research and Uses 


DIVIDUALS 
238 THE STUDY OF IN! 


Each item of the Kuder Preference Record, Vocational, is in forced- 
choice form; that is, the student is asked to choose among three activities. 
For example, the student indicates which of the three following choices he 
likes best and which he likes least. 


a. Develop new varieties of flowers. 
b. Conduct advertising campaign for florists. 
c. Take telephone orders in a florist's shop. 


If the student chooses alternative a as the best liked, he receives credit 
toward his score in the scientific and artistic areas, If he prefers alterna- 
tive b, his persuasive score is credited; if he chooses c, he is credited in the 
clerical area. 

By means of a special answer pad, the inventories are easily self-scored 
by the students; or they can be machine scored. Scores are obtained in each 
of ten areas: outdoor, mechanical, computational, scientific, persuasive, 
artistic, literary, musical, social service, and clerical. These scores are 
plotted on a profile sheet for men or women students and are thereby con- 
verted into percentile ranks for those groups. Instead of comparing his 
responses with those of workers in various Occupations, the student com- 
pares his scores in each interest area with those of *men students in gen- 
eral” or “women students in general.” 

Because of the type of item used (forced-choice or preference), Kuder 
scores reflect the student's relative interest in different fields. The “average 
score” for any student in all areas would approximate a PR of 50. The 
forced-choice type of item does not permit the enthusiastic student with 
many interests to show a higher average level of interest than does the 
apathetic student. 

A student's profile is usually interpreted by noting his two highest inter- 
est scores and referring to a list of occupations in the manual for which 
this combination of scores is characteristic. Low-interest areas should also 
be taken into account, for the student might dislike occupations involving 
such activities. 

This type of interpretation rests on assumptions about the relationships 
between interest scores and satisfaction in logically related occupations. 
The latest manual provides supplementary data showing profiles of average 
scores for various occupational groups (144 men’s and 68 women’s occu- 
ations). Means and SD’s for each occupational group are given. In gen- 
eral, the findings support logical expectations. For example, musicians are 
high in music, chemists in science, and authors in literature; there were, 
however, a sufficient number of exceptions to demonstrate that the validity 
of logical inferences should be tested empirically. 


The Measurement of Interests and Attitudes 239 


An interesting and promising device to aid in student self-appraisal and 
vocational guidance is the Kuder Preference Record, Personal, which sum- 
marizes students preferences in such categories as: (1) preference for 
being active in groups, (2) preference for familiar and stable situations, 
(3) preference for working with ideas, (4) preference for avoiding con- 
flict, and (5) preference for directing others. The same type of item (the 
forced-choice triad) is used. Since one is most interested in differences 
between scale scores, it is encouraging to find that the intercorrelations 
between scale scores are low. In a sense, this is a personality inventory with 
implications for vocational choice and job satisfaction that has the appear- 
ance of an interest inventory. А verification scale is used to identify stu- 
dents who answer carelessly or without understanding. 

The most recent inventory in the Kuder series is the Kuder Occupa- 
tional, Form D, published in 1956. This inventory closely resembles the 
Strong inventory in purpose and in techniques of construction. The one 
hundred forced-choice items in Form D were selected from the items in 
the Kuder vocational and personal inventories. This form is especially 
useful to companies that wish to develop keys for use in the placement of 
applicants in specific job situations. Until longitudinal studies provide pre- 
dictive validity data, it is impossible to judge whether this inventory will 
Prove to be as valuable in counseling as the Strong. In this test, raw scores 
are translated into “differentiation ratios.” A positive DR for an occupa- 
tional group means that a larger percentage of persons in that occupational 
group received the examinee’s score than did the base group of “men-in- 
general.” The higher the examinee’s DR in the positive direction, the 
Breater the probability that his interests are similar to those of workers in 
that Occupation. The manual includes preliminary data concerning the 
Significance of these scores in vocational counseling. Although more data 
are needed, findings to date are promising. 


OTHER INTEREST INVENTORIES The Strong and Kuder inventories, just 
discussed, are the only ones on which extensive research data have been 
cumulated. The Guilford-Schneidman-Zimmerman Interest Survey has the 
distinction of having been based on factor analysis. Nine categories of in- 
terest have been identified, each of which has two subscores (for example, 
aesthetic appreciation vs. aesthetic expression). Further research studies on 
the relationship of student test scores to later criterion data on job success 
and job satisfaction would make this inventory of greater value to coun- 
Selors, 

Another vocational interest inventory frequently used in school counsel- 


25 Ibid., pp. 552-560. 


240 THE STUDY OF INDIVIDUALS 


ing is the Occupational Interest Inventory by Lee and Thorpe. This inven- 
tory was devised by the methods outlined in Table 4.2, on content validity. 
The authors defined a universe of items, that is, the job descriptions in the 
Dictionary of Occupational Titles (a handbook prepared by the United 
States Employment Service). Within each of six areas (selected on logical 
bases rather than factor analysis), tasks were selected to represent low, 
medium, and high levels of responsibility. These tasks were presented in 
airs as forced-choice items. 

The Occupational Interest Inventory yields scores for six fields of inter- 
est: personal-social, natural, mechanical, business, arts, and science. In 
addition, responses are rescored to obtain three type-of-interest scores— 
in verbal, manipulative, and computational activities. A level-of-interest 
score, revealing whether the student has tended to choose activities at the 
routine, skilled, or professional-supervisory levels, is also obtained; this 
score is helpful to the counselor in applying inventory results to problems 
of vocational choice. 

The OII differs from the Kuder Preference Record in several respects: 
(1) it offers the student pairs of items rather than three items for com- 
parison; (2) all ОП items are concerned exclusively with occupational 
activities; (3) the Kuder scales are homogeneous while those of the OII 
are much more heterogeneous and hence more difficult to interpret. Basing 
the items on the Dictionary of Occupational Titles may make the inventory 
more acceptable to adult users but may involve the inclusion of many 
activities with which students have little knowledge or experience. 

A study of the relationship of Occupational Interest Inventory scores 


with those from the Kuder Preference Record?" revealed the following 
correlations above .60: 


OCCUPATIONAL INTEREST INVENTORY KUDER PREFERENCE RECORD r 


Personal-social Social service .627 
Mechanical Mechanical 757 
Business Clerical .627 
Sciences Scientific 192 
Verbal Literary 696 
Verbal Mechanical —.701 


The correlation between the computational scores on the two inventories 
was only .544. Examination revealed that the computational items of the 
Occupational Interest Inventory were largely clerical, whereas the compu- 


26 Edward C. Roeber, “The Relationship between Parts of the Kuder Preference 
Record and Parts of the Lee-Thorpe Occupational Interest Inventory,” Journal of 
Educational Research, vol. 42 (April 1949), p. 606. 


The Measurement of Interests and Attitudes 241 


tational scale of the Kuder included activities involving higher mathematics. 
Such comparisons indicate the importance of the counselors knowing the 
specific content of all tests for which he interprets scores. 


BASIC INTEREST GROUPS 


Although various approaches have been used in the construction of inter- 
est inventories, considerable agreement has developed on the basic interest 
groups to be considered in vocational guidance. Table 7.2 is a version of a 
table prepared by Super and Crites,”’ modified to include the Occupational 
Interest Inventory and to exclude the Allport-Vernon** and Lurie inven- 
tories, which are infrequently used in secondary schools. This table not 
only provides a summary of the interest groups measured by each of the 
leading inventories but also includes, in the last two columns (1) the re- 
sults of Guilford’s latest research in this area and (2) a synthesis by Super 
and Crites, who present their list of basic interest groups with the justifica- 
tion that “a cautiously named concept, cautiously used, is better than no 
Concept at all,”29 


VALIDITY OF INTEREST INVENTORIES 


Prediction of Success in School Subjects 


Interest inventory scores have seldom shown correlations above .30 with 
grades or achievement test results in specific subject fields.*° One study 
indicated that interest scores did predict differences in achievement Бе- 
tween courses; for example, engineer interests on the Strong Vocational 
Interest Blank correlated .61 with the difference between grades in mathe- 
matics and history." Although one must avoid generalizing from a single 
Study, this finding is in accord with the common-sense assumption that 


7* Super and Crites, op. cit., p. 382. А . 

28 A revision of the well-known Allport-Vernon scale was published in 1951. 
Extensive research on this and the earlier edition is available. Its use is ordinarily 
limited to college students and adults. However, the inventory can appropriately 
be used with superior high-school students—for example, in elective courses in psy- 
chology (G. W. Allport, P. E. Vernon, and G. Lindzey, 4 Study of Values, rev. 
ed. (New York: The Psychological Corporation, 1951). 

29 Super and Crites, op. cit., p. 383. 

?? Ibid., pp. 433-435. 

?! David Segel, “Differential Prediction of Scholastic Success," School and Society, 
Vol. 39 (January 20, 1934), pp. 91-96. 


242, THE STUDY OF INDIVIDUALS 


Table 7.2 
Basic Interest Groups as Defined in Five Interest Inventories, 
and as proposed by Super and Crites” 


а» == == ни GEER 


SUPER AND 
THURSTONE| STRONG KUDER олл. GUILFORD* CRITES” 
— 
Science 
Science (Groups Scientific Science Scientific Scientific 
LID* 
1 People Social Personal Social Social 
People (Group V) service Social Welfare Welfare 
Language > а и 
Language (Group X) Literary (Verbal) | Literary 
Things vs. Materi 
people Mechanical |Mechanical | Mechanical mena 
(Group IV) (concrete) 
Business Clerical; Systematic 
detail Computa- Clerical (record- 
(Groups VII | tion keeping) 
and VIII) Business 
Business computa- 
Е tional" Contact (with 
usiness le f. 
contact Persuasive Business EE. "a 
(Group IX) леа 
gain) 
Musician Musical Aesthetic Aesthetic 
(Group VI) expression expression 
Arts 
Aesthetic Aesthetic 
Artistic interpreta- interpreta- 
tion tion 
Outdoor Natural Outdoor 


OO 
а, P. Guilford, and others, "A Factor Analysis of Human Interests,” Psychological Mono- 
graphs, No. 375 (Washington, D.C.: The American Psychological Association, 1954). 
b After а study of the factors appearing in the Thurstone, Allport-Vernon, Lurie, Strong, and 
Kuder tests, together with the literature upon which they are based, Super and Crites developed 


The Measurement of Interests and Attitudes 243 


Table 7.2 (Continued) 
Basic Interest Groups as Defined in Five Interest Inventories, 
and as proposed by Super and Crites^ 


a 


the list of factors appearing in the last column of this table. Donald E. Super and John О. 
Crites, Appraising Vocational Fitness by Means of Psychological Tests (New York: Harper & Row, 
Publishers, Inc., 1962), pp. 383-384. 

ФА factor analysis by Strong of the Vocational Interest Blank for Men (revised) made at a 
time when only 36 occupational scales were available led to the formulation of the following 
groups: Group I—artist, psychologist, architect, physician, dentist; Group ll—mathematician, 
manager; Group IV—aviator, farmer, car- 


physicist, engineer, chemist; Group lll—production 
forest service; Group V—YMCA physical 


penter, painter, mathematics-science teacher, policeman, 
director, personnel manager, YMCA secretary, social science teacher, school superintendent, 
minister; Group VI—musician; Group Vll—certified public accountant; Group VIll—accountant, 
Group IX—sales manager, real estate salesman, life 
lawyer, author-journalist; Group XI— president 
ailable for only Groups І, 11, V, VIII, IX, and X. 
d by a regrouping of items already included 


office worker, purchasing agent, banker; 
insurance salesman; Group X—advertising man, 
of manufacturing corporation. Group scales are av 

“A summary score on type of interest, obtaine 
in scoring for the six fields of interest. 


interest would usually affect the channeling of effort, especially by noncom- 
pulsive students. 


Prediction of Occupational Choice and Job Satisfaction 


A number of research studies have indicated that persons who choose 
Occupations consistent with their interest inventory scores tend to be more 
satisfied with their jobs than those who do not? Sarbin and Anderson? 
found that 82 percent of their men clients who expressed dissatisfaction 
With their work were engaged in occupations that were inconsistent with 
their primary interests. 

Strong’s follow-up studies indica 
pational field score significantly hig 


ted that men who remain in an occu- 
her in that field than men who change 


to another field. Moreover, men who change from one vocation to another 
tend to change into a field in which their interest scores are higher than 
for their earlier choice. Impressive predictive validity data have come from 


22 Laurence Lipsett and James W. Wilson, “Do ‘Suitable’ Interests and Mental 


Ability Lead to Job Satisfaction?” Educational and Psychological Measurement, 
Vol. 14 (Summer 1954), pp. 373-380. Dallas K. Perry, “Validities of Three Interest 
Keys for United States Navy Yeomen,” Journal of Applied Psychology, vol. 39 
(April 1955), pp. 134-138. 

"ST. R. Sarbin and Н. C. Anderson, 
Measured Interest Patterns and Occupational 
Psychological Measurement, vol. 2 (January 1942), pP- 


“Preliminary Study of the Relation of 
Dissatisfactions,” Educational and 
23-36. 


244. THE STUDY OF INDIVIDUALS 


Strong's follow-up studies of the later occupational status of men who had 
taken the SVIB when they were college students. The higher a student's 
standard score in a scale, the greater the probability that he would enter 
the occupation suggested by that scale. Typically, the occupation in which 
a student was working ten or more years later ranked second and third 
for him among all the occupational scales of the SVIB.*: 

McCully? did a similar follow-up study on men who had taken the 
Kuder in Veterans Administration counseling centers several years before. 
He found that persons engaged in accounting and related fields had had 
unusually high scores in the computational and clerical areas; those 
employed in engineering and related occupations had scored relatively 
high in scientific and mechanical; those engaged in high-level sales work 
had averaged high in the persuasive area; while those in mechanical repair- 
ing, electrical repairing, and bench crafts had scored highest in the 
mechanical area. 

Most of the studies on the predictive validity of interest inventories have 
been confined to students who enter the professions or skilled trades. 
Interest inventories probably have little predictive value for occupational 
choice among semiskilled and unskilled jobs. Studies of job satisfaction 
have indicated that only people from the higher socioeconomic levels tend 
to mention interest in their work as a source of job satisfaction, Those at 
the lower levels stress job security, economic returns, and recognition as a 
person. When Clark** used Strong's approach in an attempt to develop 
interest keys for various trades, he found differential interest patterns 
within the skilled trades but not among the unskilled occupations. 


Prediction of Success in Vocations and Vocational Training Courses 


Disappointing results have been obtained with the use of interest inven- 
tory scores as predictors of success in vocational training. Air Force 
studies** revealed that almost all r's between interest scores and grades in 
thirteen vocational training schools were below .20. Interest Scores, how- 


ever, do have predictive validity when the criterion is completion of the 
vocational training.*® 


**E. К. Strong, Vocational Interests of Men and Women (Stanford, Calif.: 
Stanford University Press, 1943). 

35 C. Harold McCully, "The Validity of the Kuder Preference Record." 
lished doctoral dissertation, George Washington University, 1954. 

26 Darley and Hagenah, op. cit., pp. 8-10. 

87 Kenneth E. Clark, The Vocational Interests of Non-Professional Men. (Min- 
neapolis, Minn.: University of Minnesota Press), 1961. 

зв үү, L. Layton, The Strong Vocational Interest Blank: Research and Uses. 
(Minneapolis: University of Minnesota Press, 1960). 

29 Strong, 1943, op. cit., p. 524. 


Unpub- 


The Measurement of Interests and Attitudes 245 


In order to construct a vocational interest scale that had a fairly high 
correlation with success in a specific vocational field, one would have to 
select items that differentiated between successful and unsuccessful men 
within that vocational field. The rationale for constructing prediction tests 
is summarized in Table 4.5. SVIB scales, for example, were devised to 
differentiate between men in an occupation and professional men in gen- 
eral, rather than to differentiate between the successful and unsuccessful 
within an occupation. In other words, if the criterion of success in an occu- 
pation (rather than membership in an occupation) had been used as the 
basis for selecting items, the tests’ predictive validity for success criteria 
would undoubtedly have been increased. Such a procedure would be 
desirable if an interest inventory were to be used in selection. 

It seems logical that interest in one’s field of work would be highly cor- 
related with success in those occupations in which success depends largely 
on one’s commitment and enthusiasm. In one such vocation, insurance 
sales, Strong found that 56 percent of the men with A scores, as compared 
with only 6 percent of those with C scores, sold sufficient insurance to 
make an adequate income.“ In a study of newly employed insurance sales- 
men, Bills" found a 76 percent failure rate among those with low scores 
on the life insurance salesman and real estate salesman scales; among 
those with high scores on these scales, the failure rate was only 22 percent. 
Other research studies have revealed positive relationships between relevant 
SVIB scores and success in engineering, advertising, and technical fore- 
manship."? In the well-known follow-up study of gifted individuals by 
Terman and Oden,* those rated as “least successful” included five times 
as many men with low interest scores in their current occupations than 
did the “most successful” group. 

_ A promising approach to the study 
inventories was used by Frederiksen and 

relationship of interest and ability test scores 
ing. They found that interest in engineering 5 
ship to achievement for compulsive students, 


of the predictive validity of interest 
Melville? who studied the 
to achievement in engineer- 
howed a negligible relation- 
45 who work hard on tasks 


40 Ibid., pp. 487-488. 

^! M. A. Bills, “Relation of Strong's Inter 
Insurance," Journal of Applied Psychology, 

?? Ibid., рр. 501—504. 

“SL. M. Terman and M. Н. Oden, 
Calif.: Stanford University Press, 1948). 

4t Norman Frederiksen and Donald S. Melville, 
fhe Use of Test Scores" Educational and Psychological Measurement, 
(Autumn 1954), pp. 647-656. 

45 The researchers used two indicators of compulsiveness: (1) having interests 
like those of professional accountants and (2) having relatively low speed of read- 
ing scores, in comparison with vocabulary scores on a standardized reading test. 


est Blank to Success in Selling Casualty 
vol. 22 (December 1938), pp. 97-104. 


The Gifted Child Grows Up (Stanford, 


“Differential Predictability in 
vol. 14 


DUAL 
246 THE STUDY OF INDIVIDUALS 


whether they are interested or not. However, for noncompulsive students, 
who work hard only when interested, interest scores were significantly 
correlated with achievement in engineering. While it is always unwise to 

eneralize from a single research study, it seems reasonable that a person 
with adequate abilities but low interest can do well in a field but may not 


if he is noncompulsive. 


INTERPRETATION OF INTEREST-INVENTORY RESULTS 

In Chapter 17, an illustration is given of the use of interest-inventory 
scores in counseling, in combination with aptitude test results and other 
relevant data. This section is concerned chiefly with the problems of inter- 
pretation that grow out of the limitations of interest inventories and the 
inadequacy of research data needed as a basis for valid inferences from test 
scores. 

The trained counselor recognizes the comparatively low relationships 
between interests and abilities, as well as the critical importance of con- 
sidering both ability and interest in evaluating vocational choices. In many 
school situations, however, various factors have led to an enthusiastic, 
uncritical use of interest inventories. Since interest-inventory results are 
not threatening to students, the inventories are often self-scored and the 
results graphed by students in a group situation. Although this process 
can be a helpful one, there is real danger of misinterpretation when interest 
profiles are analyzed by students without personal assistance from a well- 
trained guidance worker and in the absence of data about aptitudes and 
other limiting factors. Procedures are outlined in Chapter 17 to minimize 
the dangers of such misinterpretation. 

One need only read through an interest inventory to realize that many 
responses of a junior high school student may be invalid because of his 
lack of experience. The adolescent’s vocational interests are part of his 
changing, gradually emerging concept of self; they tend to become more 
stable and realistic as he matures. These factors do not rule out the use of 
interest inventories in junior high school as part of the process of helping 
young people in self-appraisal and stimulating them to consider a wider 
variety of vocational opportunities. They do, however, suggest caution in 
the interpretation of inventory results as indicators of suitable vocational 
choices. Many students at the junior high school level do not have well- 
defined patterns of interest that have implications for vocational choice. 
For such students, of course, interest-inventory results are of little value. 

The need for more adequate research data on the significance of interest- 
be overemphasized. Until such time as data are 


i tory results cannot ^ tim 
enatis patterns of skilled and semiskilled workers, 


available on the interest 


The Measurement of Interests and Attitudes 247 


interest-inventory results will be of limited value for a large proportion 
of secondary school students.' 

More studies on the interrelationships among scores on the various 
inventories would be helpful to counselors. Counseling experience has 
indicated, for example, that apparent discrepancies between Kuder and 
Strong scores may be significant for vocational guidance." For example, 
some persons who had high persuasive scores on the Kuder inventory but 
low life-insurance-salesman scores on the Strong seemed, on the basis of 
interview and case-history material, to be interested in promotional activi- 
ties but to dislike activities in which they need to push people to the point 
of action, as in closing a sale. Research data supporting such hypotheses 
as this would help counselors to do a more professional job in the inter- 
pretation and use of test data. 

Although job applicants can and do fake results on interest inventories, 
it would seem that students seeking guidance would report frankly con- 
cerning their likes and dislikes. Even under these circumstances, however, 
à student's objectivity may be reduced by his desire to see himself in a 
preferred occupational role and to choose activities that are associated 
with that role in an occupational stereotype. When students are instructed 
to answer the Strong inventory so as to obtain high scores in a specified 
Occupation, the majority do receive A ratings. Research findings indicate 
that the Strong inventory is more susceptible to upward faking, and the 
Kuder to downward faking.*? Through the use of verification scales, it is 
possible to identify examinees who try to fake, as well as those who succeed. 

Despite their limitations, interest inventories can be very helpful in 
aiding students to focus attention on one or two fields of choice, in stimu- 
lating them to study possible vocations, in suggesting occupations not 
previously considered and fields in which special aptitude tests may be 
desirable, in helping girls to clarify their thinking in the career-vs.-home 
decision, and in helping students to distinguish between genuine interests 
and expressed interests based on extrinsic pressures (parental aspirations, 


hero worship, and the like). 


46 рг, Kenneth Clark has been systematically studying the problems involved 
in the measurement of interests at the lower occupational levels. Several years of 
research with enlisted men, conducted for the Office of Naval Research, has resulted 
in the development of an unpublished interest inventory, the Minnesota Vocational 
Interest. Inventory. When this inventory is published, it should help in the coun- 
seling of students who are choosing among such skilled and semiskilled occupations 
as retail sales work, truck driving, and working in the various building trades. 

4T Donald E. Super, “The Kuder Preference Record in Vocational Diagnosis,” 
Journal of Consulting Psychology, vol. 11 (July-August 1947), pp. 184—193. 

48H. P. Longstaff, “Fakability of the Strong Interest Blank and the Kuder 
Preference Record," Journal of Applied Psychology, vol. 32 (August 1948), pp. 


360-369. 


248 THE STUDY OF INDIVIDUALS 


An invitation to discuss the results of an interest inventory often brings 
to the counselor's office students who need help on a variety of significant 

roblems. In the course of his discussion regarding vocations, the student 
will naturally discuss any academic difficulties he may be having and will 
frequently bring in problems in his relationships with friends, family, or 
employer that he needs to discuss with his counselor. 


MEASUREMENT OF ATTITUDES 


Definitions of Terms 


There is no clear-cut distinction between (1) questionnaires regarding 
an examinee's likes, dislikes, and preferences, which are called interest 
inventories, and (2) questionnaires regarding his attitudes, which are 
typically organized into attitude scales. Greene*® makes the distinction that 
interest inventories are usually concerned with a person's feelings or pref- 
erences regarding personal activities, and the choices involved have no 
moral connotations; while attitude scales assess the person's position on a 
continuum of approval-disapproval toward social institutions, group activi- 
ties, and principles that affect the welfare of others. In this sense of the 
word, a person with a high interest in an occupation, such as engineering, 
is one who personally enjoys many of its activities, while a person who 
has a favorable attitude toward engineering may be one who values it as a 
significant, high-status occupation even though he might personally dislike 
the activities it involves. 

Attitudes may be defined as “predispositions to react negatively or posi- 
tively in some degree toward an object, institution, or class of persons." 
For example, there are many ways in which a person might react negatively 
or positively toward the profession of engineering. A person with a 
high positive attitude toward the profession might encourage students 
to enter it, might listen respectfully to the views of engineers on civic 
problems, or might vote for a candidate for local office largely because 
he was an engineer. 


Various Approaches to the Study of Attitudes 


OBSERVING MANIFEST ATTITUDES Obviously, one approach to studying 
attitudes is to observe them as they are manifested in behavior. For ex- 


49 Edward B. Greene, Measurements of Human Behavior (New York: The 
Odyssey Press, 1952, p. 594). 

50 Тат C. Nunnally, Jr., Tests and Measurements: Assessment and Prediction. 
(New York: McGraw-Hill Book Company, Inc., 1959), p. 300. 


The Measurement of Interests and Attitudes 249 


ample, science teachers who wish to develop in students the "scientific 
attitude" (a strong positive attitude toward the methods of science), could 
observe evidences of such an attitude in students’ laboratory work, group 
discussions, and other aspects of science study. Although it is doubtful 
that the teacher would obtain a sufficiently large sample of behaviors to 
justify reliable appraisal of this attitude in individual students, sufficient 
evidence might be obtained to justify conclusions regarding group progress. 

In an attempt to appraise behavior changes in students as a result of 
science instruction, West developed several behavioral descriptions to 
be used by trained observers during instruction periods in science, The 
descriptions for critical-mindedness and open-mindedness are quoted below: 


Critical-mindedness. Enter the code number 7b against the name of each 
pupil for evidences of critical-mindedness toward the class situation as shown 
by weighing evidence with respect to its pertinence, soundness, or adequacy. 
Examples are: Pupil asking for statement of source of information before 
accepting it; pupil verifying statements read or heard; pupil questioning author- 
ity constructively; pupil questioning truth of statement before he is willing to 
accept it as final. 

Open-mindedness. Enter the code number 1c against the name of each pupil 
for evidence of being willing to abandon predetermined ideas in favor of ideas 
Which seem to be more nearly correct. Examples are: Pupil accepting good, 
clear evidence without useless argument; pupil welcoming suggestions and 
information about new undertakings; pupil accepting evidence which modifies 
false beliefs; pupil willing to respect the viewpoint of others; pupil willing to 
acknowledge his ignorance of the situation; pupil exhibiting conduct which 
Shows that he is free from prejudices; pupil taking criticism kindly and attempt- 
ing to profit by it.*: 


Teacher effectiveness in the observation of attitudes, as manifested in 
behavior, may be heightened by taking advantage of special situations 
in which student attitudes are more readily revealed and observed. Students 
àt work in a laboratory situation ог discussing the results of a teacher 
demonstration are likely to offer evidences of their open-mindedness, habit 
Of suspending judgment, habit of looking for natural causes, and other 
Scientific attitudes. Similarly, attitudes toward civic participation might be 
manifested as students work together in student-government activities. 

The teacher can set up special situations for the observation of attitudes, 
for example, asking students to act Out possible solutions to unfinished 


51], Y. West, A Technique for Appraising Certain Observable Behavior of Chil- 
dren in Science in Elementary School (New York: Bureau of Publications, Teachers 


College, Columbia University, 1937), p. 16. 


250 THE STUDY OF INDIVIDUALS 


stories?? or films, or assigning students to committee work on problems 
that require teamwork for their solution. 


STUDYING EXPRESSED ATTITUDES The preceding examples have in- 
volved manifest attitudes, as evidenced in student behavior. The approaches 
were similar to those used in the study of manifest interests, that is, observ- 
ing and recording relevant behaviors of the student. Two other approaches, 
each paralleling an approach used in the study of interests, are (1) to 
obtain direct expressions of attitudes, and (2) to develop questionnaires 
or inventories of attitudes, in which the examinee reacts to a large sampling 
of verbal statements. Just as in the study of interests, we find that the 
latter approach (because of the larger sampling and the minimizing of 
the effect of stereotypes) tends to yield more reliable results than request- 
ing direct expressions of over-all attitudes toward an institution or class 
of people. That is, summating students' reactions to a series of selected 
statements regarding church, war, or a specific ethnic group usually gives 
more consistent results (from one time to another) than requesting an 
over-all expression of attitude. 

Attitude inventories are similar to interest inventories in at least two 
ways: (1) they are self-report questionnaires in which the examinee can 
usually modify his responses if he desires to do so; (2) they contain a 
sampling of verbal statements to which the examinee gives responses. How- 
ever, attitude inventories differ from interest inventories, in the way in 
which items are selected and scored. The items on attitude inventories 
are selected to represent a universe of attitudes toward the object, institu- 
tion, or class of persons. They are not selected for their correlation with 
some external criterion. They are used for the first and fourth purposes, 
rather than the second and third purposes, outlined in the discussion of 
validity in Chapter 4. Hence, their content and construct validity are 
usually more important than their concurrent or predictive validity. 

In social studies classes, or any other situations in which a variety of 
defensible attitudes might be held, the teacher must not grade or other- 
wise evaluate students on the basis of attitudes they hold. The teacher's 
legitimate concern is with the student's ability to formulate and defend 
his own attitudes with respect to labor-management or other social studies 
problems, rather than with the specific attitudes held. When we study 


52 See George Shaftel and Fannie R. Shaftel, Role Playing the Problem Story 
(New York: National Conference of Christians and Jews, 1952). The use of 
reaction stories is discussed further in Chapter 8 of this textbook. 

53 An exception would be an inventory of study attitudes, such as the Survey 

Study Habits and Attitudes by Brown and Holtzman, which would be expected 
pee redictive validity for the prediction of grade-point average in academic 


edel ication data for this and other inventories are given in Appendix A. 


work. Publ 


The Measurement of Interests and Attitudes 251 


attitudes in an area in which competent judges would defend different 
points of view, we must describe attitudes rather than evaluate them. If 
the teacher imposes a right vs. wrong type of evaluation, in such an area 
"expected answers" will be given; and the entire procedure becomes 
indefensible. 

Since appraisal of individual attitudes as a basis for grading is inappro- 
priate, published attitude questionnaires have been used chiefly in research 
work. They may also be appropriately used for checking on shifts of 
attitude in classes or groups as a basis for evaluating the effectiveness of 
instruction, provided that they are administered anonymously. For ex- 
ample, a teacher might wish to know how the group attitude had changed 
toward some health practice, some foreign country, or the United Nations, 
as the result of a unit of work that included reading and discussion 
concerning any of these areas. "n" 

If student responses to an attitude questionnaire can be summarized in 
such a way as to locate each examinee at some point on a continuum from 
“strongly negative attitude” through “neutral” to “strongly positive attitude,” 
the questionnaire can be called an attitude scale. To develop an attitude 
Scale, a series of statements representing different degrees of positive and 
Negative attitudes must be formulated. In order to achieve content validity, 
the statements might well be based on the results of interviewing a герге- 
Sentative sampling of subjects or analyzing statements of attitude written 
by such a representative group. 


THE THURSTONE METHOD OF ATTITUDE-SCALE CONSTRUCTION Ап early 
method of constructing attitude scales, which is still widely used, was 
developed by Thurstone and his associates. After a large number of 
Statements (which expressed various degrees of positive and negative feeling 
toward some institution or group) are obtained, each statement is repro- 
duced on a card or slip of paper. Then, a large number of judges inde- 
pendently sorted these slips according to their position on an Ш-рош 
Continuum (ranging from "extremely favorable" through "neutral" to 
'extremely unfavorable"). The judges do not give their personal reactions 
to the statements but arrange them on a continuum of intensity of positive 
and negative feeling. . = 

Ttems that are assigned a variety of values by the judges are eliminated 
as ambiguous or as unrelated to the attitude being judged. Only those 
Statements that show relatively low interjudge variability are retained. 
From among these statements, there would be selected 15-25 statements 
that were fairly evenly spaced, with respect to median rating, on the 


5+1. L. Thurstone and E. J. Chave, The Measurement of Attitude (Chicago: 
University of Chicago Press, 1929). 


STUDY OF INDIVIDUALS 
252 THE S A 


attitude continuum. For example, on an 11-point scale, items might be 
selected with median intensity values of 0, 0.5, 1.0, 1.5, and the like, 
through 11. | 

Once the statements have been selected, they are arranged in random 
order on the printed form. Тће student then marks those statements with 
which he agrees; and his score is the median intensity value of statements 
he has marked. Scales of the Thurstone type have been constructed for 
attitudes toward church, war, censorship, capital punishment, and many 
other institutions and issues, as well as for attitudes toward number of 
ethnic groups. The scales are quickly administered and easily scored. 
The method of scaling is objective and reliable because many independent 
judgments are used. 


THE LIKERT METHOD OF ATTITUDE-SCALE CONSTRUCTION The first 
step in the Likert method is also the collection of a large number of state- 
ments expressing various degrees of positive and negative feelings about 
an object, institution, or class of persons. The selection of items for the 
attitude scale, however, does not involve the use of judges; rather, the 
selection is based on the results of administering the items to a representa- 
tive group of subjects. Each item is rated by subjects taking the attitude 
scale on a five-point continuum from "strongly approve" to "strongly 
disapprove.” The total score is the sum of all the item scores.** The validity 
of each item is studied, with the criterion being total score on the pre- 
liminary edition. Only those items that have high correlations with total 
score are retained. Items that have low correlations are excluded as either 
unreliable or as measuring some extraneous attitude factor. As a result, 
the shorter revised attitude scale is much more homogeneous than the 
preliminary edition. It has greater internal consistency, a characteristic 
which is necessary, but not sufficient, for construct validity. 

Тће advantages of the Likert method include (1) greater ease of prep- 
aration; (2) the fact that the method is based entirely on empirical data 
regarding subjects’ responses rather than subjective opinions of judges; 
(3) the fact that this method produces more homogeneous scales and 
increases the probability that a unitary attitude is being measured; and 
(4) the scales provide more information about the subject's attitudes, 
since an intensity reaction is given to each of many items. The chief dis- 
advantage of the Likert method is that the scores are relative to the group 
used in scale construction, whereas the Thurstone method establishes a 
neutral point as a basis of reference. If Likert-type scales were normed 


55 Each favorable item is scored by giving five points for "strongly agree," 
four points for "agree," three for "uncertain," and the like. The scoring is reversed 
for unfavorable statements. 


The Measurement of Interests and Attitudes 253 


on representative groups, the scores could be more adequately interpreted. 
It is doubtful, however, that such norming will be done. Attitude scales 
are used chiefly in research studies, and most researchers find that an 
attitude scale devised to suit their specific purpose is more suitable than 
any of the published scales.** 

The Thurstone and Likert-type scales for the same institution or group 
tend to yield results that agree or intercorrelate highly. Reliability coef- 
ficients for such scales tend to be in the .80's and are highly satisfactory 
for group comparisons. Recent studies on attitude-scale construction evi- 
dence a trend toward the development of homogeneous subscales within 
a larger area, for example, toward aspects of an institution such as the 
church or the United Nations. 

The chief criticism that might be leveled at attitude scales is concerned 
with the indirectness of measurement, that is, verbal statements are used 
as a basis for inferences about “real attitudes.” Moreover, attitude scales 
are easily faked. Although administering the scales anonymously may 
increase the validity of results, anonymity makes it difficult to correlate 
the findings with related data about the individuals unless such data are 
obtained at the same time. It seems that we must limit our inferences from 
attitude-scale scores, recognizing that such scores merely summarize the 
Verbalized attitudes that the subjects are willing to express in a specific 
test situation. The student will recognize the difficulty of studying the con- 
current validity of verbal attitude scales by studying their relationship 
with behavioral criteria. Because of many factors, persons with the same 
attitudes will manifest different behaviors. 


SUMMARY STATEMENT 


Four types of data concerning the interests of students can be obtained: 
(1) expressed (verbal) interest in specific occupations or activities; (2) mani- 
fest interest, as evidenced by the student's actual participation in some activity; 
(3) tested interest, as reflected in the student's informational background and 
Vocabulary in special interest fields; and (4) inventoried interest, as sum- 
marized in a pattern of scores indicating the student's preferences for types 
Or areas of activity. Almost all interest measures developed and standardized 
for school use are of the fourth type. 

Although interests and abilities tend to be related, students must not be 
advised to enter occupations in which they show a high interest level unless 
data indicate that the students also have the abilities requisite for success. 


59 The research worker who wishes to develop his own scales will find Edwards' 
Manual a valuable source of information. This manual illustrates step-by-step pro- 
cedures for the Thurstone and Likert approaches, as well as the more recently 
developed Guttman approach. Allen L. Edwards, Techniques of Attitude Scale 


Construction (New York: The Psychological Corporation, 1957). 


254 THE STUDY OF INDIVIDUALS 


Research data concerning the stability of students’ vocational interests indi- 
cate that the interests of young adults (that is, those in the upper college years) 
are much more stable than those of younger students. For this reason, Strong 
does not recommend the use of his inventories with students below age 17. The 
relative instability of the vocational interests of younger students does not rule 
out the use of interest inventories in the ninth or tenth grades, but it does imply 
the need for greater caution in the interpretation and use of results. 

Two very different approaches have been used in the development of the 
leading interest inventories. A student's scores on the Strong inventories indicate 
the extent to which his expressions of like or dislike for specific activities, 
school subjects, and so on are similar to those of employed workers in each 
of more than thirty occupational groups. That is, a student with an A rating 
on the Actor scale has interests that agree much more closely with those of 
actors than with those of "men in general.” The Kuder Preference Record 
requires the student to indicate his preferences within groups of three activities; 
such indications of preference are then summarized according to the interest 
areas (clerical, artistic, and the like) which they represent. 

Although different approaches have been used in the construction of interest 
inventories, an examination of five of the leading measures (Table 7.2) reveals 
considerable agreement on the basic interest groups. On the basis of their 
analysis, Super and Crites proposed the following eight basic interest groups: 
scientific, social welfare, literary, material (concrete), systematic (record- 
keeping), contact, aesthetic expression, and aesthetic interpretation. 

As a basis for more discriminating use of interest-inventory results in coun- 
seling, a number of cautions concerning the interpretation of interest scores 
were formulated: (1) Interest inventory results must not be interpreted as 
indicators of occupations in which students will be successful. (2) The relative 
instability of vocational interests among younger students suggests the need 
for great caution in the interpretation of interest scores for junior high school 
students. (3) A large number of students do not have well-defined patterns of 
interest that have implications for vocational choice. (4) Until further research 
reveals more well-defined interest patterns for semiskilled workers and for 
workers in the leading occupations for women, interest-inventory results will 
be of limited value for a large percentage of high school students. 

Techniques of studying attitudes manifested in student behavior were briefly 
reviewed, as well as two leading methods of attitude-scale construction. 


SELECTED REFERENCES 


BERDIE, RALPH F., "Strong Vocational Interest Blank Scores of High School 
Seniors and Their Later Occupational Entry," Journal of Applied Psy- 
chology, vol, 44 (June 1960), pp. 161—165. 

CRAVEN, E. С., The Use of Interest Inventories in Counseling. Professional Guid- 
ance Series. Chicago: Science Research Associates, 1961. 

DARLEY, J. G., Clinical Aspects and Interpretation of the Strong Vocational 
Interest Blank. New York: The Psychological Corporation, 1941. 

AND THEDA HAGENAH, Vocational Interest Measurement: Theory and 

practice. Minneapolis, Minn.: University of Minnesota Press, 1955. 


The Measurement of Interests and Attitudes 255 


DURNALL, EDWARD J., JR., “Falsification of Interest Patterns on the Kuder Pref- 
erence Record," Journal of Educational Psychology, vol. 45 (April 1954), 
pp. 240-243. 

GUILFORD, J. P., AND OTHERS, "А Factor Analysis Study of Human Interests," 
Psychological Monographs, vol. 68, No. 4. Washington, D. C.: American 
Psychological Association, 1954. 

KATZ, MARTIN R., "Interpreting Kuder Preference Record-Vocational Scores: 
Ipsative or Normative?" The Vocational Guidance Quarterly, vol. 10 
(Winter 1962), pp. 96-100. 

KUDER, 6. FREDERIC, AND BLANCHE B. PAULSON, Exploring Children's Interests. 
Chicago: Science Research Associates, 1951. 

LAYTON, WILBUR L., Counseling Use of the Strong Vocational Interest Blank. 
Minnesota Studies in Student Personnel Work No. 9. Minneapolis, Minn.: 
University of Minnesota Press, 1958. 

, The Strong Vocational Interest Blank: Research and Uses. Minneapolis, 
Minn.: University of Minnesota Press, 1960. 

LONGSTAFF, H. Р., “Fakability of the Strong Interest Blank and the Kuder Pref- 
erence Record," Journal of Applied Psychology, vol. 32 (August 1948), 
pp. 360-369. . 

MCARTHUR, CHARLES, "Long-Term Validity of the Strong Interest Test in Two 
Subcultures," Journal of Applied Psychology, vol. 38 (October 1954), pp. 
346-353. 

, AND LUCIA BETH STEVENS, “The Validation of Expressed Interests as 
Compared with Inventoried Interests; a Fourteen Year Follow-Up,” 
Journal of Applied Psychology, vol. 39 (June 1955), pp- 184-189. 

MALLINSON, GEORGE G., AND WILLIAM M. CRUMBINE, “An Investigation of the 
Stability of Interests of High School Students,” Journal of Educational 
Research, vol. 45 (January 1952), pp. 369-383. 

REMMERS, H. H., Introduction to Opinion and Attitude Measurement. New 
York: Harper & Row, Publishers, Inc., 1955. : " 

STEPHENSON, RICHARD R., "А New Pattern Analysis Technique for the SVIB, 
Journal of Counseling Psychology, vol. 8 (Winter 1961), pp. 355-362. 

STRONG, EDWARD K., JR., Vocational Interests 18 ca After College. Minnea- 
polis, Minn.: University of Minnesota Press, 1955. | 

‚ Уосапопа! ин of Men and Women. Stanford, Calif.: Stanford 
University Press, 1943. 

SUPER, DONALD E., AND JOHN О. CRITES, 4 
of Psychological Tests. New York: Harper & R 
Chapters 16-18. 


ppraising Vocational Fitness by Means 
ow, Publishers, Inc., 1962, 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. Outline a talk to be presented to a group guidance class before students 
take and self-score an interest inventory. Emphasize the values and limitations 
of a specific interest inventory, other sources of data on interests, the relation- 
ship between interests and aptitudes, and the like. 

2. Obtain interest-inventory data for two or three high school students. Draw 
the test profiles for each student. Record any data given on the cumulative 


256 THE STUDY OF INDIVIDUALS 


record regarding expressed interests (recorded vocational choices) and manifest 
interests (choice of elective subjects, participation in extracurricular activities, 
and the like). Summarize all data on interests and indicate three or four occu- 
pations suggested by the interest pattern and general intelligence level of each 
student. What aptitude tests would you like to administer to each of these stu- 
dents to obtain necessary additional information? 

3. What problems are involved in the use of interest inventories in an employ- 
ment situation? Examine a specific interest inventory to see how applicants 
could intentionally misrepresent their pattern of interests. Consult the Buros' 
Mental Measurements Yearbook, as well as research studies on the extent to 
which responses can be faked on this inventory. 

4. Compare the methods used in constructin 
the Kuder interest inventories. 

5. Of what value are level-of-interest scores, such as those obtained on the 
Occupational Interest Inventory. 

6. What are the advantages and disadvantages of usin 
tions as items in an interest inventory? 

7. What informal methods could you use to obtain evidence as a b. 
judging students' interests in your subject field? 

8. What are some of the problems involved in the testin 
procedures seem to yield the best results? 

9. Observe students in a discussion group and make anecd 
concerning the social attitudes evidenced in their discussion. 


10. Prepare a list of behavioral evidences of the scientific attitude in the sub- 
ject area in which you will teach. 


8 and validating the Strong and 


g the names of occupa- 
asis for 
g of attitudes? What 


otal recordings 


Informal Methods of 
8 Studying Personal-Social 
Adjustment 


As the reader studies this and the succeeding chapter, it will become in- 
creasingly evident that he should not expect to have as much success in 
evaluating students’ personal-social adjustment as in evaluating their scho- 
lastic aptitude or their skills in reading and arithmetic, Not only are the 
techniques of evaluation less well developed, but the most valid techniques 
are available only to the clinical psychologist or the psychiatrist. Effective 
use of personality inventories and projective techniques (discussed in 
Chapter 9) demands more training, more time, and a greater insight into 
individual cases of maladjustment than the classroom teacher can be 
expected to achieve. 

Why, then, should teachers and counselors attempt evaluation in this 
difficult area? The answer lies largely in the importance of the child’s 
Personal-social adjustment to every aspect of his development. The stu- 
dent's mental health affects his ability to learn, his interest in learning, his 
ability to contribute to classroom experiences, and his later success as a 
Citizen, an employee, and a parent. Attempts to help the student who is 
Tetarded in reading or any other area of school work often reveal that 
the student's learning difficulties are inextricably related to his problems 
of personal-social adjustment. 

Teachers are able to observe a student’s reactions in a wide variety of 
Situations that reveal his feelings and attitudes—situations that involve 
Social participation, rivalry, outside authority, success and failure, praise 
and blame, They are in a unique position to identify and refer to guidance 
Specialists those students who are unusually tense or withdrawn, or show 
Other evidences of poor mental health. 

The results of any one technique of evaluating student adjustment may 
have low reliability and validity. The teacher, however, has an oppor- 
tunity to study many samples of student behavior and to note their con- 


257 


258 THE STUDY OF INDIVIDUALS 


sistency; to make hypotheses concerning possible causal factors contributing 
to poor adjustment; and then to examine these hypotheses in the light 
of additional information. 


THE NATURE OF PERSONAL-SOCIAL ADJUSTMENT 


Teachers and counselors want to contribute to the mental health of 
students, rather than to help them develop any specified pattern of per- 
sonality traits. Defining good mental health, or good personal-social adjust- 
ment, however, is not as simple as the defining of many other educational 
goals. We shall begin with a brief examination of three major concepts in 
this area. These concepts of maturity, normality, and adjustment will 
be found to overlap. Each of them, however, contributes to an under- 
standing of the nature of personal-social adjustment. 


Maturity 


The concept of maturity is especially significant to those who work with 
growing children. Every teacher has students in his classes who may best 
be described as relatively immature in certain respects, whose behavior is 
characteristic of a younger age group. А second-grade child who соп- 
tinually interrupts the teacher, a fourth-grader who tattles, or a junior 
high school student who frequently complains to the teacher about other 
students is not functioning at the expected level of maturity. Buhler and 
others use the term “average developmental expectation" for such “age 
norms." They point out that teachers have difficulty with students who 
have not developed "school maturity” in the sense of willingness and 
ability to carry out work assignments at the sacrifice of immediate 
impulse satisfactions.* 

The term “average developmental expectation” implies that a student’s 
maturity should be judged in comparison with others of his own age and 
culture group. Children cannot be judged in terms of their progress toward 
adult behavior. Docile children can learn to display behavior typical of 
much older children or of adults, but such docility should not be mistaken 
for genuine maturity. In fact, in the preadolescent or adolescent years, 
such conforming behavior may indicate immaturity and prolonged depend- 
ence upon adult approval. | . . 

A child whose behavior is fairly typical of children of his age is con- 
sidered to be at a satisfactory level of maturity; a child whose behavior 


1 Charlotte Buhler, Faith Smitter, and Sybil Richardson, Childhood Problems 


ind the Teacher (New York: Holt, Rinehart and Winston, Inc., 1952), p. 43. 
6 


Studying Personal-Social Adjustment 259 


is like that of much younger children is considered immature. It is dif- 
ficult to describe an above-average degree of maturity in children without 
including in the "superior" group those whose pseudomaturity may be 
achieved at the cost of internal conflict as well as less social acceptance 
within their own age group. 


Normality 


In one sense of the word, *normal" implies average or typical. Statisticians 
use the word in that sense. In many aspects of personal-social adjustment, 
behavior is perhaps best described as being similar to, or deviating from, 
average behavior. It is normal, for example, for children to become rest- 
less after a long period of drill work or for adolescents to show less 
respect toward adults than do younger children. No value judgment is 
implied in this use of the term “normal.” 

Deviations from average behavior are signals to the teacher that a 
student should be observed more carefully. If a student is unusually tired 
after physical exercise, unusually tense when called upon to recite, un- 
usually excited when he is given a small role in the class play, or unusually 
depressed when he is criticized, the teacher will want to discover the reasons 
behind this deviating behavior. . А 

It is normal for children to have problems. When an individual is unable 
to cope successfully with his problems, he may show abnormal behavior— 
Not only in a statistical sense but in the sense of its being inappropriate, 
ineffectual, unwholesome, or self-defeating. In this sense of the word, 
“normality” is used with the meaning of “health” vs. “illness.” “Abnor- 
mality” implies malfunctioning behavior; that is, behavior that is inappro- 


priate and ineffective in achieving its purpose. 


Adjustment 

The term “adjustment” is most frequently used to describe how well a 
person gets along in a situation. The employer will point out well-adjusted 
employees, the teacher well-adjusted children, in terms of their conformity 
to the environmental demands of the work or school situation. To a 
Psychologist, however, adjustment implies not mere conformity but a 
harmonious relationship between the individual and his present environ- 
Ment. A person can achieve adjustment either by adapting his behavior 
to the requirements of a situation or by changing the situation to meet 
his personality needs. | . р 
Students may show different degrees of effectiveness in adjusting to 
different areas of living. A student’s adjustment cannot be judged entirely 
on the basis of how he acts in school. For some students, school is the 


260 THE STUDY OF INDIVIDUALS 


only area of serious maladjustment; for others, school may be the area of 
best adjustment, bringing them satisfactions that they cannot achieve in 
other areas of living. 


ADAPTIVE BEHAVIOR AND INTERNAL CONFLICT Redl and Wattenberg 
warn that, since adjustment describes a relationship between an individual 
and his environment, “adjustment is a reasonable criterion of mental health 
only if the demands of a situation are reasonable.”? One would not wish 
to apply the term "well adjusted" to the child who adjusts to the code of 
a gang of child thieves or the adult who adjusts to the requirements of the 
Nazi regime. Children adapt to the constantly nagging parent or teacher 
by developing psychological deafness; to the tyrannical parent or teacher 
by various types of aggression or by passive resistance. These reactions 
indicate fairly normal adaptive devices—and a need for change in the 
environmental situation. Such adaptive behavior may be essential for 
the person's "survival" in psychological terms. 

Redl and Wattenberg stress that adjustment is concerned with much 
more than the harmony between a child's surface behavior and environ- 
mental demands. 


Are the things he is doing in harmony with his own feelings? If a person is 
torn by deep, unresolved conflicts, no matter how docile the behavior, he can- 
not be considered well-adjusted. Under pressure from home, for example, 
some youngsters will do phenomenal school work, but become sullen and 
irritable in the process. . . . A child's happiness is an important clue.? 


In other words, an adjusted person “is able to work out good relationships 


in environments that are in harmony with his own values and do not make 
unreasonable demands upon һіт.”* 


THREE TYPES OF FAILURE TO ADJUST TO ENVIRONMENTAL DEMANDS 
Psychologists have frequently studied the problem of personality through 
observation of abnormal behavior. If the concept of adjustment is viewed 
from the negative point of view, it is seen that poorly adjusted indi- 
viduals may be grouped into three general categories, described by 
Cattell as follows: 


Psychotics escape right out of the culture pattern into unreality. Neurotics 
try hard to conform but do so at the cost of ruinous internal mental conflict. 


2 Fritz Redl and William W. Wattenberg, Mental Hygiene in Teaching (New 
York: Harcourt, Brace & World, Inc., 1959), p. 169. 

3 Ibid., р. 170. 

4 Ibid., р. 169. 


Studying Personal-Social Adjustment 261 


Delinquents prefer to have the conflict between themselves and society. Neu- 
Totics and delinquents have in common an incapacity to take the cultural 
pattern. . . . They differ in that the delinquent is generally under-inhibited and 
the neurotic over-inhibited.5 


The neurotic youth has tended to inhibit the impulses that arise from 
his failure to work out a harmonious adjustment with his culture; the 
delinquent (sometimes as a result of low intelligence, undesirable neigh- 
borhood influences, or failure to develop an effective conscience through 
early identification with a parent or other adult) satisfies his impulses in 
ways that bring him into conflict with society. 


A WORKING CONCEPT OF PERSONAL-SOCIAL ADJUSTMENT On the basis 
of the review of concepts just presented, the following working concept of 
personal-social adjustment is presented as an orientation or frame of ref- 


erence for Chapters 8 and 9. . | 
А person may be described as having good personal-social adjustment 


who 


l. Establishes reasonably harmonious relationships with others in different en- 
vironmental settings (in home, school, and community) without developing 
persistent internal conflicts that make him unhappy, dissipate his energy 
through nervous tension, or result in ineffective behavior. 

2. Is able to devote most of his energy to the satisfaction of purposes or goals 
that he accepts as worthy and that are accepted as such by his culture. — 

3. Shows a degree of control of emotions and impulses that is typical of his 
аре group; retains the spontaneity, creativity, and willingness to experiment 
and explore that are essential to further growth; is increasingly able to work 
toward more remote goals and to accept guidance from others without 
servility, evasion, or resentment. 

^. Conforms sufficiently well to the stan 
Culture groups as to allow himself to ac 
to be accepted by these groups. 


dards or codes of his own age, sex, and 
hieve a sense of belongingness and 


PERSONALITY DESCRIPTION 


For some purposes, such as in vocational guidance, we are not merely 

Concerned with identifying poorly adjusted children. We are interested 

In describing the individual's personality. A 
If we are going to attempt personality description 


sible approaches: (1) the psychometric approach, in which we attempt to 
Obtain a numerical estimate of each of several dimensions or traits of 


, there are two pos- 


5 Raymond B. Cattell, 4n Introduction to Personality Study (London: Hutchin- 
son’s University Library, 1950), p. 182. 


262. THE STUDY OF INDIVIDUALS 


personality for each individual, and (2) a more impressionistic or clinical 
approach, in which we use observation, interviews, and other techniques 
to obtain clues concerning the individual's needs, problems, and conflicts, 
and integrate all this evidence into a composite, integrated picture of the 
person as à whole. 

In the area of personality study, the educator or psychologist com- 
mitted to the psychometric approach tends to administer a standard set 
of questions, called a personality questionnaire or inventory. He prefers 
to ask everyone the same questions; and he prefers selection-type ques- 
tions to those allowing free response. He is especially pleased if he can 
compare a person's responses with those made by persons in a norming 
sample, who have answered the questionnaire under similar conditions. 
He wants to interpret the individual scores he obtains in terms of the 
findings of well-designed validation studies. 

Although his methods seem very scientific, the psychometrician tends 
to select for study only those aspects of personality on which one can 
develop objectively scored questions. He tends to disregard significant 
aspects of personality that elude precise definition. His defense would be 
that what cannot be defined cannot be measured; that a technique or 
instrument on which results vary with the examiner or test-user does not 
provide admissible evidence. 

A psychologist who prefers the clinical approach to personality study 
will prefer free-response questionnaires and test situations in which the 
person has an opportunity to interpret the question or task as he per- 
ceives it, and respond in an individualized manner. The clinical approach 
admittedly results in data that are not comparable from one person to 
another and that must be subjectively interpreted. However, the clinical 
approach can lead to usable hypotheses when the leads produced from a 
variety of sources tend to reinforce each other and when additional data 
are obtained in areas in which the findings do not converge. 

In contrasting the psychometric and clinical approaches to the study 
of individuals, Cronbach makes a helpful analogy to concepts in 
“information theory.” 


He [the information theory specialist] distinguishes two attributes of any 
communication system: bandwidth and fidelity. . . . 

The classical psychometric ideal is the instrument with high fidelity and low 
bandwidth. . . . A college aptitude test tries to answer just one question with 
great accuracy. ..- | 

At the opposite extreme, the interview and the projective technique have 
almost unlimited bandwidth . . . the interviewer may cover twenty topics in a 
half-hour, and note an even larger number of traits. . . . 

Bandwidth can be greatly increased when it is possible to confirm or reverse 
judgments at a later time . . . Narrowband instruments are desired to make 


Studying Personal-Social Adjustment 263 


final, irreversible decisions about important matters (e.g. scholarship awards). 
... As a first stage, the wideband test scans superficially a range of important 
variables, pointing out significant possibilities for further study. In this use the 
wideband procedure is used for hypothesis formation, not for final decisions. 
· . . The fallibility of wideband procedures does no harm unless the hypotheses 
and suggestions they offer are regarded as verified conclusions about the indi- 


vidual.* [Italics added] 


General observation of individuals, interviews that range widely over 
any topics of concern to the student, and projective tests in which persons 
give individualized responses to ambiguous pictures or to inkblots have 
wider bandwidth and lower fidelity. 


SOURCES OF DATA ABOUT THE PERSONAL-SOCIAL 
ADJUSTMENT OF INDIVIDUALS 


Self-report 


In the preceding chapter on interests and attitudes, we found that the 
Most widely used methods involved self-report instruments. In both interest 
inventories and attitude scales, the test-user summarizes and interprets 
what the individual has voluntarily reported about himself, In studying 
personal-social adjustment we also tend to rely on self-report. This appears 
to be an ideal approach because each person knows more about his fears 
and hopes, his feelings of adequacy ог inferiority, than his teacher or his 
peers could possibly know. The interview, the autobiography; and the 
Personality inventory are attempts to tap this source of information. The 
interview and autobiography will be discussed in this chapter, while the 
advantages and limitations of personality inventories are considered in 
Chapter 9, 


Observation of Relevant Behavior 


Another approach is to study behavior relevant to the individual’s 
Personal-social adjustment. A person may say that he is brave but behave 
in a cowardly manner; he may say that he is at ease та social group and 
yet show obvious signs of tension when observed in a social gathering. 

If one could obtain a large, representative sampling of relevant be- 
havior, undistorted by the effects of being observed, one would probably 
have a more valid basis for judging a person’s fearfulness or social poise 
than could ordinarily be obtained from a self-report inventory. However, 


Lee J. Cronbach, Essentials of Psychological Testing (New York: Harper & 
Row, Publishers, Inc., 1960), pp. 602-604. 


THE STUDY OF INDIVIDUALS 
264 


as we stressed in the discussion of indirect measurement in Chapter 5; 
many significant behaviors are not readily observed and do not recur fre- 
quently. Moreover, the natural or real-life situations in which we observe 
individual A are not comparable to those in which we observe individual B. 

When we attempt to set up special situations to evoke from each person 
the behavior we wish to observe (such as a standard situation in which 
each person is purposely frustrated), we gain in comparability of situa- 
tions; however, the situation may become artificial and the behavior of 
the person be modified thereby. 

Teachers, however, have unusually fine opportunities for observing stu- 
dents in a variety of natural situations—in the classroom, on the play- 
ground, and in extracurricular activities. Observation in natural situations 
involves greater bandwidth with corresponding loss of fidelity. However, 
as already emphasized, if such techniques are used as a basis for /rypotheses 
that are subject to revision, they can prove very useful. Suggestions for 
the use of observation in both natural and specially designed situations 
will be considered in this chapter. 


Projective Techniques 


Projective tests provide rich opportunities for observing behavior; they 
are especially designed to stimulate the person to behave so as to reveal 
more of his "real self," of his partially repressed fears and hopes, than 
he would if he were on guard. The examiner tries to stimulate his imagina- 
tion and to encourage him to make free, uncensored responses. А child 
may be asked to act out dramatic scenes of his own choosing with dolls 
and miniature props; or a person may be asked to tell a story about each 
of several pictures. These pictures have been selected as having psycho- 
logical significance but ambiguous content, that is, pictures that can be 
interpreted in any one of several ways. 

Observing and interpreting an individual's behavior in projective test 
situations (which stimulate his imagination and reduce his censorship of 
his responses) requires special training, not only in test administration 
but in the background of psychological theory required for the formula- 
tion and testing of hypotheses. Special courses on these procedures are 
provided for persons preparing to be clinical psychologists or to work in 
related fields. Projective techniques will be briefly described in Chapter 9. 


The Opinions of Others 


As a basis for making certain types of judgments, the opinions of others 
are necessary. In the selection of students or employees, one cannot 
depend entirely on self-report techniques; ratings by teachers or employers 


Studying Personal-Social Adjustment 265 


on relevant characteristics seem to be essential. Teachers usually wish to 
communicate to students and parents their impressions concerning the 
student's work habits and attitudes, his sportsmanship on the playground, 
and other aspects of his personal-social development. Teachers’ ratings 
on these characteristics are ordinarily included on elementary school 
report cards. 

If we wish to study the social acceptance of individual students, socio- 
metric techniques can prove valuable. A summary of these data will reveal 
which students are unchosen by others; these students might not have been 
identified by observation, for they may be *hangers-on" or “fringers” in 
Social groups. Sociometric techniques, rating scales, and other means of 
obtaining the opinions of others are considered in this chapter. 


SELF-REPORT TECHNIQUES 


Self-report inventories have been deferred to Chapter 9 because they should 
be used only by persons with special training in their interpretation. Other 
self-report techniques that can be used effectively by teachers include: 
(1) interviews and (2) autobiographical materials. 


Interviews 


w data as a basis for ranking students 


Whenever we plan to use intervie 
d about obtaining comparable data. 


ог employees, we become concerne 
That is, when interviews are used as part of a selection process (for 
example, with applicants for scholarships, college admission, or employ- 
ment), it is essential to agree upon the information needed, the character- 
istics on which ratings are to be made, an 
questions that should be used. 

In many situations, however, we à 


d probably to compile a list of 


re not attempting to rank individuals. 


We may want information that will aid in diagnosis and in forming hypoth- 
€ses about desirable next steps in helping the individual. Hence, the focus 
of the interview will change from student to student. The teacher or 
counselor may be especially concerned about Mary's difficulties in gaining 
acceptance in a social group, with Peter's inability to accept criticism, with 
Roy's tendency to close his mind to new ideas, or with Susan's feelings 
of conflict between her desire to go far in a chosen career and her desire 
to be “one of the gang” among her peers. A — 

If the teacher believes that the student's problems originate in his family 
relationships, he will skillfully lead the conversation to home responsi- 
bilities; leisure activities the student enjoys most and his companions in 


Such activities; activities participated in with mother, father, brothers, 


266 THE STUDY OF INDIVIDUALS 


and sisters; how the student usually spends his weekends; whether he 
has a favorite brother or sister; and the like. These questions should 
stimulate conversation that reveals the student's underlying feelings about 
his family relationships. 

The interview is valuable in working with almost any type of problem. 
The nature of the interview, its length, and the number of interviews 
necessary must be individualized to fit the student, his maturity level, 
and his problem. 


PREPARING FOR THE INTERVIEW The teacher or counselor should pre- 
pare for an interview by reviewing available data and organizing his own 
knowledge about the student. The cumulative-record folder should be 
checked, especially for comments by previous teachers, relevant test data, 
and home information. It is highly important, for example, to know 
whether there is a stepparent, whether other relatives live in the home, 
the number and age of siblings, and other facts that will help to direct 
the teacher's questions and also increase his awareness of topics on which 
the student might be unduly sensitive. 

If the teacher has prepared for the interview by organizing information 
about the student's behavior and possible environmental pressures, he will 
probably have made certain hypotheses about the “why” of any disturbing 
behavior. Such hypotheses can be of great value in giving direction to the 
interview and helping the teacher to elicit information that supports or 
negates his tentative diagnosis. If the teacher is to use his time to greatest 
advantage, he should make use of such hypotheses. However, he must ђе 
alert to other causal factors that he may not have considered, and he must 
avoid the danger of seizing upon a single explanation and marshalling 
evidence to support it. 


ADEQUATE RAPPORT AND COMMUNICATION The importance of achiev- 
ing rapport with the student cannot be overemphasized. The teacher will 
already have laid the foundation for such rapport in his day-by-day rela- 
tionships with the student. In order for the atmosphere to be friendly and 
relaxed, many elementary school teachers invite children to help them after 
school with some interesting activity. 

The counselor may be meeting the student individually for the first 
time. However, he will have had opportunities (through group meetings) 
to communicate to students that he is a person who will listen to their 
problems and will help them make their own decisions. On the basis of 
earlier group contacts that have helped him to develop an appropriate 
image of his role, the counselor can now communicate his interest in the 
counselee as an individual, 


Studying Personal-Social Adjustment 267 


The effectiveness of the teacher's or counselor's communication to the 
student depends not only on his use of a simple vocabulary but on his 
using concrete illustrations instead of generalities. Moreover, it is impera- 
tive that the teacher or counselor avoid blocking communication by putting 
the student on the defensive through moralizing or cross-examination, or 
by seizing the initiative and dominating the interview situation. 

Valid and helpful information is more readily obtained by being an 
attentive and sympathetic listener than by doing most of the talking. As 
one gains in experience as an interviewer, one learns to avoid the type 
of question that can be answered by "yes" or “no” and to substitute ques- 
tions that stimulate the student to describe significant aspects of his environ- 
ment and the way he feels about them. 


Interpreting Information Obtained through Interviews 


Because of the subjectivity of the interview technique, it is important 
that information obtained through this means be interpreted in the light of 
more objective data from other sources. Students sometimes give informa- 
tion that is distorted or actually false. The student who needs help most 
тау be the least willing to confide; moreover, he may have little insight 
into the reasons for his own behavior. 

Although it is essential to distinguish fact from fiction, the teacher 
must realize that information on the students’ feelings and his distorted 
perceptions is highly important. A student's feeling that his mother is dis- 
appointed in him or that his home is old-fashioned is a significant part 
of the information needed to help him, even though his perceptions do not 
agree with the facts as seen by a more objective observer. | 

Although the teacher must avoid cross-examining the student about his 
family relationships, he can usually obtain highly significant information 
through the student's informal conversation and through his replies to 
Seemingly casual questions concerning home duties and routines, family 
recreation, the ages of his brothers and sisters, activities they share to- 
gether, and the like. Such information should be sought not out of curiosity 
but as a basis for understanding the adjustment problems of the student. 
It is generally recognized, for example, that a student's attitude toward 
adults in authority, toward imposed tasks, and toward school is con- 
ditioned to a large extent by his attitude toward parental authority. It is 
also recognized that a student's reaction to his classmates may represent 
à displacement of his feelings toward his brothers and sisters. 

At the risk of oversimplification, an attempt has been made in Table 
8.1 to summarize the “йо and don't's" of interviews concerned with 


Students? adjustment problems. 


268 THE STUDY OF INDIVIDUALS 


Table 8.1 
Pointers on Interviews with Students Concerning Adjustment Problems 


= 


DON'T 


1. Confuse the real purpose of the in- 
terview with the immediate problem 
that precipitated it. 

2. Talk too much or dominate the in- 
terview situation. 

3. Cross-examine the student. 

4. Seem rushed or preoccupied. 

5. Moralize or pass judgment on a 
student's behavior in such a way 
as to lower his self-respect. 

6. Antagonize the student or put him 
on the defensive. 

7. Prod the student into revealing 
confidential information about his 
friends or his family that he may 
later regret having told you. 

8. Create a feeling of dependence on 
the part of the student. 

9. Try to accomplish too much in one 
interview. 

10. Judge the value of the interview en- 
tirely on the basis of results accom- 
plished at that time. Even though 
little specific progress seems to have 
been made, a good interview may 
have laid the basis for closer 
teacher -student - relationships and 
later confidences. 


DO 


. Prepare for the interview by obtain- 


ing and organizing pertinent infor- 
mation. 


. Help the student to relax through 


an informal greeting, participation 
in some after-school activity, and/or 
discussion of pleasant topics of es- 
pecial interest to him. 


. Put yourself in the student's place; 


try to see things through his eyes. 


. Be an attentive listener, watching 


for leads suggested by the student's 
conversation. 


. Talk on the student's level, maintain- 


ing an attitude of cooperation rather 
than a display of authority. 


. Shift as much responsibility to the 


student as he is able to handle, help- 
ing him to think through "next steps." 


. End the interview with a forward- 


looking attitude, leaving the student 
with less anxiety and greater self- 
confidence. 


- Make special note of the student's 


last comments as he leaves the inter- 
view situation; he may touch on 
problems that he wished to discuss 
but could not find the courage to 
propose earlier. 


Ор ти От 


After the interview, the teacher ог counselor should record the salient 
points on a simple *Record of Interview" form for filing in the student's 
folder. 


Autobiographical Material 


An autobiography may be brief and stereotyped or a highly revealing 
and helpful document. In order to avoid a routine chronology or listing of 
facts, the teacher should discuss the writing of a good autobiography with 
the class, He should read excerpts from autobiographies that involve the 
writer's problems, worries, feelings, and attitudes. Then he should suggest 


Studying Personal-Social Adjustment 269 


the kinds of subjects that students may wish to include in their auto- 
biographies, such as places in which they have lived, their parents’ wishes 
and concerns for them, their closest friends, their changing vocational 
plans, their worries, and other significant topics. Students should be assured 
that their autobiographies will be read only by the teacher and discussed 
with no one. If autobiographies are to be filed in the students’ folders, 
students should know that they will be used only by their own counselors. 

The writer of an autobiography may choose to omit, to overemphasize, 
or to distort any aspect of his life. This very freedom of expression, how- 
ever, makes the report a revealing one to the person who has the necessary 
training and experience for its interpretation. An incident reported in 
unusual detail is usually of great significance to the writer. Omission of a 
period in childhood may indicate merely that things went well during that 
period; however, the omission of reference to one parent or one sibling 
may be indicative of a strained relationship. | 

Зоте autobiographies—in fact, most autobiographies of children below 
age l4—are a compilation of commonplace facts and of incidents that 
appear to have no special significance." However, when an older adoles- 
cent writes an autobiography of this type, it may be an indication that he 
is (1) a highly defensive individual who hides his problems, (2) a person 
With shallowness of feeling, ог (3) а student who was not "sold" on the 
Significance of the assignment or who distrusted the use that would be 
made of the material. 

In reading an autobiography, a teacher or guidance worker should be 
Sensitive to variations in tone or mood. In speaking, a person reveals 
through his voice, gestures, and facial expression the extent of his con- 
cern with his subject; similarly, he may show in his writing (for example, 
through the use of emotionally charged words) that certain experiences 
have touched him deeply. 

A teacher or counselor who has considerable factual data about a 
Student may be able, in reading his autobiography, to detect misrepre- 
Sentations of facts or evidence of self-deception. A. student, for example, 
may have avoided facing, or may have distorted, facts about his low 
achievement, lack of social acceptance, or failure in job situations. Such 
distortions usually indicate areas in which the student feels especially 
vulnerable, 

Since adequate interpretation of autobiographical material involves great 
Sensitivity and skill on the part of the interpreter, the teacher or counselor 
working with such materials should increase his insights into this and 


* Gordon W. Allport, The Use of Personal Documents in Psychological Science 
(Washington, D.C.: Social Science Research Council, 1942), p. 80. 


270 THE STUDY OF INDIVIDUALS 


other child-study techniques through participation in case conferences 
and other in-service education experiences, and through study of references 
that discuss the autobiography in greater detail than is possible in general 
textbooks on measurement and evaluation. 


OBSERVATION OF BEHAVIOR 


Self-report techniques can, under ideal circumstances, provide exceedingly 
helpful information obtainable in no other way. It would be unwise, how- 
ever, to limit our study of an individual to the information he voluntarily 
reports. We should capitalize on our opportunities to observe his actual 
behavior in a variety of situations. The observation of actual behavior, 
and techniques of improving its validity and reliability, will be considered 
in this chapter section. 


Systematic Observation of Behavior 


In research studies, systematic observations are often made of a single 
type of behavior as it occurs in natural situations. A time-sampling plan 
is typically developed in which each subject is observed in a random 
sampling of situations. For example, in studying aggressive behavior in 
nursery school children, we would have to define carefully which behaviors 
were to be classified as "aggressive." We would have to decide whether 
“hitting a child and taking his toy" was one or two incidents of aggres- 
sive behavior. We might plan to observe each child for several five-minute 
periods randomly scattered throughout a nursery school session. We would 
train observers and check on the degree of agreement between observers 
in their counting of aggressive behaviors (for the same children during 
the same time interval). In other words, we would do everything we could 
to increase objectivity and avoid bias, with respect to times observed, 
interpretation of terms, and many other factors. If we thought that the 
presence of observers would modify the children's typical behavior, we 
might observe the children through a one-way-vision screen. 


Informal Observation of Behavior in Natural Situations 


The teacher is not willing to restrict his range of observation as nar- 
rowly as is the researcher. The teacher is interested in any behavior that 
helps him to understand the child better and give him leads as to how the 
child can be motivated and guided. 


SITUATIONS AFFORDING OPPORTUNITIES FOR OBSERVATION Teachers 
are able to observe children in a wide variety of situations, to observe the 


Studying Personal-Social Adjustment OFT 


roles a child plays in different social groups, and to note variations in his 
behavior from one situation to another. Instead of formally setting aside 
observation periods, the teacher notes significant incidents whenever they 
occur. These incidents may occur in various kinds of classroom activities, 
in the cafeteria, on the playground, and in extracurricular activities. 
Certain activities, however, are especially likely to provide clues with 
respect to the student's personal-social adjustment. 


T. 


+ When students are working with ot 


Тће informal discussion period (sometimes called the "show and tell" 
period) that characterizes many elementary school classrooms is an excellent 
time for observing child behavior. Note a child's willingness or reluctance 
to volunteer, his tendencies to exaggerate or even to fictionize, his need to 
compete with other children in telling about "bigger and better" exploits. 
During these discussions, children often reveal valuable information about 
relationships with parents or siblings, typical leisure activities, home respon- 
Sibilities, and the like. 

When students are working on arithmetic practice materials or other “self- 
directed activities,” the teacher can observe their attitudes toward an im- 
posed task, their persistence or distractibility, their dependence on other 
Children or adults, their confidence in their own judgment in activities in 
Which there is no set procedure (for example, word problems or laboratory 


exercises), and the like. a 
hers on a group activity, the teacher has 


e to which each student seems able to 


show cooperative behavior, the extent to which he dominates the group and 
the techniques he uses in doing so, his reactions when his suggestions to the 
Broup are rejected, and the like.* When students are working on committee 
activities under the leadership of a classmate, the teacher can observe which 
Students seem unable to take the initiative, follow through on assigned 
Tesponsibilities, or show good work attitudes unless they are working under 
the close supervision of adults. пума А | 
Any period in which students discuss controversial issues is an excellent time 
for observing student behavior. А student's willingness or reluctance to take 
а stand, his tendency to defend his own ideas just because they are his, or 
to accept or reject ideas because he likes or dislikes the individuals proposing 
them, his need to compete with other students and to win in any discussion— 
all are significant clues to his adjustment. | 2 

When а teacher sponsors а club or other extracurricular activity, he has the 
Opportunity to observe students in their spontaneous social groupings and 
to obtain more valid data concerning those who are popular or socially 
isolated than he can obtain in the classroom. Students who seem at ease in 
the relatively formal atmosphere of the classroom may show shy, withdraw- 
ing behavior or aggressive, exhibitionistic behavior when participating in 
informal social activities with their peers. 
Students’ reactions in role playing and othe 
often revealing. The characters they choose 
them by their peers, may be significant. А 


Opportunities to observe the degre! 


r types of dramatization are 
to play, or those assigned to 
s students participate in such 


8 Gertrude Driscoll, How to Study the Behavior of Children (New York: Bureau 


of Publications, Teachers College, Columbia University, 1941), p. 15. 


272 THE STUDY OF INDIVIDUALS 


impromptu dramatizations, the teacher can note which students take the 

initiative, which seem to be too tense and inhibited to participate in role 

playing, and which are characteristically assigned to background roles. Chil- 
dren whose tenseness makes them withdraw from creative self-expression 
may be able to participate in dramatizations using hand puppets or shadow 
lay.? 

ЕЙ ене а student's behavior in creative art activities may help the teacher 
to achieve greater understanding of him. Using art materials that demand 
less skill opens the channels of creative expression to a larger number of 
students than would otherwise be possible. Finger painting is considered 
especially valuable as a technique requiring little skill, permitting great 
flexibility, and stimulating freedom of expression. Observation of children 
at work and their voluntary interpretations of their art productions may 
contribute to greater teacher understanding of the child. 

8. Observation of children’s behavior on the elementary school playground is 
an essential part of any study of their personal-social adjustment. When 
playground activities are not directed by adults (as during the lunch period 
or the period before school opens), highly competitive social situations may 
exist in which children show primitive types of aggression, some children 
are unmercifully teased or are denied use of equipment, and timid children 
seek refuge from the overwhelming energy and aggressiveness of their more 
active classmates. In directed play situations at the elementary level, and in 
physical education activities at the high school level, the teacher has the 
opportunity to note energy level, physical coordination, proficiency in the 
physical skills so important for social acceptance, self-assurance, and leader- 
ship ability. 


LEARNING TO DESCRIBE, RATHER THAN JUDGE Making a judgment 
about a pupil’s laziness, cooperativeness, or some other characteristic is 
not justified unless the observer has followed the procedures used in 
research studies in defining the characteristic being studied and in obtain- 
ing large, representative samplings of behavior. As a rule, however, 
teachers do not need to appraise each student with respect to his status on 
certain personality dimensions. They are more interested in formulating 
better hypotheses about why a child behaves as he does. 

In the formulation of such hypotheses, descriptions of the ways in which 
a child reacts to specific situations are most helpful. Instead of labeling a 
student as “cooperative” or “uncooperative,” “interested” or “uninter- 
ested,” the teacher should note the situations in which the student was 
cooperative or uncooperative or the activities in which he showed the 
greatest or the least interest. 


RECORDING OBSERVATIONAL РАТА Casual, unrecorded observations are 
interesting but may be misleading as a basis for diagnosing a student's 
needs. More helpful diagnostic leads will be obtained through the use of 


ФА. G. Woltmann, “The Use of Puppets in Understanding Children,” Mental 
Hygiene, vol. 24 (July 1940), pp. 445-458. 


Studying Personal-Social Adjustment 273 


a behavior journal, in which repeated observations are recorded so as to 
provide a sampling of the student's behavior on his good days as well 
as his bad days. These observations should be made during class dis- 
cussions as well as during supervised study; in student-directed as well as 
teacher-directed activity. Insofar as the teacher's opportunities permit, they 
should be made in the cafeteria, on excursions, and in extracurricular 
activities, as well as in the classroom. 

The most valuable observational records are descriptions of significant 
incidents in the life of the student. These descriptions are frequently 
termed anecdotal records. In the anecdotal record, the teacher describes 
an incident, setting forth briefly and objectively the actual happenings, the 
setting of the incident, and (if desired) his own interpretation of the sig- 
nificance of the behavior. The teacher's interpretation should be separate 
from the description of the incident and should always indicate whether 
ог not the behavior is typical of the student. The date and time of day 
should always be indicated, as well as the setting of the incident and the 
type of group activity. = | 

The following anecdotal record for a seventh-grade boy is cited as illus- 
trative of the characteristics of such records. 


February 25 


SETTING During the noon period when the weather is cold, the pupils usually 
Spend their time in the gymnasium playing games. 


THE INCIDENT David frequently remains at school although he lives only a 
Short distance from school. Today, as usual, he entered none of the games. He 
Stayed close to the stage, jumped oft it several times and rolled around on it. 
(The stage is at one end of the gym.) He made no effort to join in any of fhe 
games. When asked why he did not play with some of the other boys, he replied, 


“They don’t want me.” 


INTERPRETATION Because of his small stature and his physical condition, 


David cannot compete with the boys of his own age. The group to which he 
would like to belong has not accepted him.!? 


The teacher who is beginning to write anecdotal records will find the 


following suggestions practical: 


ive study. 


1. Start by selecting one or two students for intens Е 
k as possible. 


2. Describe as many significant incidents each wee 


1 Theodore І. Torgerson, Studying Children (New York: Holt, Rinehart and 


Winston, Inc., 1947), pp. 88-89. 


274 THE STUDY OF INDIVIDUALS 


3. Do not try to interpret every incident. Make a summary analysis at con- 
venient periods and look for developmental trends in behavior. 

4. Concentrate on describing those types of behavior which you believe to have 
a bearing on the student's difficulties. 


As a teacher becomes sensitive to the many symptoms students show in 
their behavior every hour of the school day, his problem becomes one of 
selecting those that are the most significant and that justify the time spent 
in recording. The focus of the teacher's observation will vary from student 
to student according to the individual's major problems and the teacher's 
hypotheses concerning possible reasons for his difficulties. 

Teachers who systematically record their observations of student be- 
havior develop invaluable records that not only help them in understanding 
individuals but serve as a basis for conferences with parents and with pro- 
fessional staff members as well. 


Observation of Behavior in Specially Devised Situations 


When we observe student performance in the skills as a basis for cvalu- 
ation, we attempt to set up standard situations so that students will be 
performing under comparable conditions. In basketball, for example, we 
specify where the student shall be standing when he shoots baskets; in 
hurdle racing, we specify how high the hurdles shall be and how far apart. 
When personnel officers in industry and the armed services attempted to 
appraise different aspects of personal-social adjustment, they also tried to 
standardize stimulus situations. 


SITUATIONAL TEsTS One type of specially devised situation that has 
proved especially promising is the leaderless group discussion. A small 
group of applicants for teaching positions, for example, might be assigned 
an interesting and fairly controversial topic, such as the advantages and 
limitations using television or team teaching. No one is assigned respon- 
sibility as moderator. During the course of the discussion, data are recorded 
concerning the number and nature of each individual's contributions, the 
leadership pattern that develops, each person's reactions to disagreement 
with his point of view, and the like. It is obvious that this technique can 

rovide as much information on a number of variables as could be obtained 
in informal observations of each person over a comparatively long period 
of time. 

Various other “situational tests" involving teamwork in the solving ofa 
difficult problem, reactions to continued frustration and criticism from co- 
workers, and the like have been used in research studies, and as one basis 
for selection and classification of employees. Essentially, a situational test 


Studying Personal-Social Adjustment 275 


places the subject (or subjects) in a situation simulating the real-life situ- 
ation in which we would like to observe his behavior. It is similar in many 
respects to the work sample test, discussed in Chapters 5 and 12; however. 
the criterion behavior we are attempting to sample is even more complex 
and more difficult to interpret than that studied in evaluating performance 
in the skills. : 


_ SPECIALLY DEVISED SITUATIONS IN THE CLASSROOM The use of situa- 
tional tests as a basis for ranking individuals on personality dimensions 
Tequires careful planning and is very expensive. Teachers, however, can 
utilize specially devised situations in eliciting behavior of a type they 
especially wish to observe. For example, a teacher can assign students to 
work together on some fairly difficult project; he can assign individuals 
to various responsibilities on a classroom newspaper or other classroom 
project. Or he can use an unfinished film or “reaction story” to stimulate 
students to role play alternative endings to the story. 

If reaction stories are used, they should be carefully selected to meet 
several criteria. A reaction story should lend itself to good oral reading, 
being sufficiently dramatic to hold the interest of the students and yet 
realistic enough to be plausible; it should involve characters with whom 
Students can easily identify themselves; and it should present a genuine 
problem or issue, to which students of this age would offer and defend a 
Variety of solutions." The following résumé of a reaction story indicates 


that it would meet these criteria: 


THE BLIND FISH 
Children growing up in a society 


Developmental task: respect for authority. 
1 desires and impulses because of 


have to learn to curb many of their persona 
the regulations necessary to successful group life. 

The basic situation of this story is that of a boy who does not obey rules set 
down for the welfare of his whole group. A dozen boys at camp go out on à 
hike with a counselor. They explore a big cave. The counselor lays down some 
rules for their safety: they must stick together and follow him closely, for if 
апуопе gets separated from the group in this underground maze, he might be 
lost for days. There is danger, too, of falling off ledges and sliding into pits. 
The hikers pass an underground pool. One boy induces another to lag behind 
with him. They wade into the water and their lights disclose fish in the clear 
pond. They catch some and discover that the fish are blind, and grow excited 
Over their unique find. Then one boy steps into a deep hole and goes under 


11 George Shaftel and Fannie R. Shaftel, Role Playing the Problem Story (New 


York: National Conference of Christians and Jews, 1952). 


276 THE STUDY OF INDIVIDUALS 


water. He cannot swim and would have drowned if the other boy had not been 
along to save him. Even the good swimmer, however, gets a cramp in the cold 
water and barely manages to reach the bank. The crowd returns. Hearing about 
the blind fish, they all want to catch some. The counselor refuses to permit it. 
Тћеу clamor that he's not fair; the first two boys had caught some of the fish. 
So why can't they all? 


The teacher may prefer to have individual students write out their story 
endings and hand them in. This procedure has the advantage of obtaining 
reactions from all students; moreover, the reactions of individuals are not 
influenced by the positions taken by class leaders. Written statements that 
appear to be especially revealing can be filed for further reference. Written 
answers, however, may involve less spontaneity; and for students who do 
not write fluently, the reactions may be brief and conventional.** 


OBTAINING THE OPINIONS OF OTHERS: 
"TEACHER RATING SCALES 


We have already considered two major sources of information about the 
individual's personal-social adjustment: (1) self-report through interviews 
and biographical materials and (2) observation of behavior (in natural 
and in specially devised situations). The methods we will consider in the 
remainder of this chapter are based upon the opinions of others. These 
opinions are most valid when they are based on adequate observation. That 
is, teachers' ratings and peer ratings are valid and reliable only to the 
extent that they are based on large and representative samplings of ob- 
servations. Ratings made by teachers and peers on students they do not 
know well or on characteristics that are not evidenced in observable be- 
havior tend to have low reliability and validity. 


12 Ibid., pp. 41-42. 

13 Although they may lack some of the stimulus value of the reaction story, certain 
story titles may be assigned that will stimulate students to indulge in fantasy. For 
example, the teacher might assign the title "If I . . .," which the student may com- 
plete in any way he wishes, for example, "If I Were a King," or “If I Had a Million 
Dollars." Other stimulating titles might be: "When I'm Through School," "My Day- 
dreams," “If I Could Have Three Wishes," "If I Could Have My Way,” and the 
like. Further suggestions concerning theme or story topics that stimulate self-expres- 
sion, as well as the use of open-ended questions and incomplete sentences, are given 
in many textbooks on guidance, as well as in Hilda Taba and others, Diagnosing 
Human Relations Needs: Studies in Intergroup Relationships (Washington, D.C.: 
American Council on Education, 1951). 


Studying Personal-Social Adjustment 277 


Observation as a Basis for Evaluative Judgments or Ratings 


Teachers are often asked to record judgments concerning student be- 
havior on a rating scale of behavior traits. In fact, the report card is 
actually a rating scale in which numbers or letters are used as a method 
of communicating teachers’ judgments to students and parents. Obviously, 
ratings on responsibility, sportsmanship, and other characteristics are of 
little value unless they are based on careful, unbiased observation of stu- 
dent behavior. 

ТЕ classes are too large and daily schedules too crowded for the teacher 
to cumulate a large number of anecdotal records for each student, a real- 
istic compromise is presented by such a record sheet as that developed for 
the Personal and Social Development Program.* This record form is or- 
ganized to focus teacher attention on behavioral evidence of four “personal 
traits" and four "social traits" of significance in pupils? personal-social 
adjustment at school. They are as follows: 


PERSONAL TRAITS SOCIAL TRAITS 


5. Social adjustment 

6. Sensitivity to others 

7. Group orientation 

8. Adaptability to rules and conventions 


1. Personal adjustment 

2. Responsibility and effort 
3. Creativity and initiative 
4. Integrity 


In order to help teachers to define each of the "traits," several positive 
and negative examples of behaviors classifiable under that trait are in- 
cluded. When a teacher makes a brief dated entry of a "critical incident," 
he notes the code number of the item. Examples for the trait “sensitivity to 
others" are cited as illustrative: 


Sensitivity to Others 
BEHAVIORS TO BE ENCOURAGED 


Saw that others were not left out 

Cheered up, complimented, or encouraged others 

Was kind to someone with handicap or special problem 
Tactfully provided something for needy child 

Did something for person not feeling well 

Corrected or made suggestions to another in tactful manner 
Interceded for or stuck up for another 


пол ние 


BEHAVIORS NEEDING IMPROVEMENT 


1. Left another child out of activity 


2. Referred to another's race, religion, ог nationality in a disparaging manner 


за John C. Flanagan, Personal and Social Development Program (Chicago: Science 


Research Associates, Inc., 1956). 


У OF INDIVIDUALS 
278 THE STUD 


Called another child names | 

Made fun of ог teased another about handicap 

Laughed at the mistakes of others | | | | 
Used sarcasm and disparaging remarks in making criticisms and suggestions 


to or about others 


IP 


Ideally, a teacher should record many descriptive nonjudgmental observa- 
tions in a behavior journal before making judgments about student be- 
havior. Then when judgmental ratings are needed as a means of shorthand 
communication to students and parents, he can review his cumulated 
anecdotal records and arrive at summary judgments. 


Designing Rating Scales 


The first step in the development of a rating scale is the selection and 
definition of the traits to be rated. The traits listed on a report card or 
rating scale should be: 


1. Independent, in the sense that overlapping should be avoided (for example, 
responsibility and reliability should not both be listed, for they are too closely 
related) ; 

2. Definable in terms of observable student behavior (as in the examples given 
above from the Personal and Social Development Program); 

3. Related to the major goals of the school program; 

4. Reasonably homogeneous or unitary, that is, the trait does not involve com- 
ponent behaviors that have low intercorrelations. 


An example of a trait that violates the last criterion is one included in a 
school-district rating scale under the heading “Не does his written assign- 
ments." The trait description for the highest level reads: *Always com- 
pletes written assignments on time; has them well organized and in good 
form." The three components included (punctuality, organization, and 
form) would not appear to be highly correlated with each other; hence 
difficulties in the rating process and ambiguity in communication result. 

On most rating scales, the rater's judgment of the student is indicated 
in one of three ways: (1) by a numerical rating, (2) by a check at any 

oint on a line that best describes the degree of the trait possessed by the 
student (graphic rating scale), or (3) by a descriptive term that most 
closely describes him. 

Scales requiring a simple numerical rating are used in Chapter 12. 
Graphic rating scales permit the assignment of any intermediate rating, 
for example: 

Low High 
y 


Neither the numerical nor the graphic type, however, help to define degrees 
of the trait being measured. 


Studying Personal-Social Adjustment 


Immm 


1. From the American Council on Education rating scale for prospective 
college students 


Does he get others to do what he wishes? 


279 


Probably Lets others ^ Sometimes Sometimes Displays 
unable to take lead leads in leads in marked 
lead his minor important ability to 
fellows affairs affairs lead his 
fellows; 
makes 
things go 


EL LL PENNE 


2 


Beach, California, City Schools) 


From a "Rating Scale for Evaluating Work Habits and Skills" (Long 


Edd EUR NND TEL LE 
A B С р Е 
He follows Attends to Sometimes Pays little 
instructions written lets his attention 
and oral in- attention to in- 
structions; wander; structions; 
follows usually follows 
directions follows directions 
directions reluctantly 


accurately 


or not at all 


MENU. = 


3. From the Haggerty-Olson-Wickman Behavior Rating Schedules 


How does he accept authority? 


Respectful, Entirely Ordinarily Critical of 
complies by resigned, obedient authority 
habit accepts all 

authority 


School for Nursing 


Defiant 


4. From a rating scale for nurses used by the University of Michigan 


a үү 1 RM 


ADJUSTMENT Some- Slow to 
TO times at adapt to 
SITUATIONS alossin пем sit- 
familiar uations 
situa- 
tions 


Learns Quick іо Very 
newar- adjust quick to 
range- to new respond 
ments routine to emer- 
fairly gencies 
soon 


сыш ыз ш ош шш RE SS 


Fig. 8.1 Illustrative Не 


ms from a Variety of Rating Scales 


280 THE STUDY OF INDIVIDUALS 


Descriptive rating scales are more difficult to construct but have greater 
communication value. If the trait descriptions are carefully worded so that 
the terms on the negative end of the scale do not sound too damaging, 
they may also help to distribute ratings more widely over the scale. In 
Figure 8.1 are presented illustrative items from several descriptive rating 
scales. In each of these the traits are reasonably homogeneous and are 
described in behavioral terms. 

Since it is difficult for a teacher to make evaluative judgments on all 
traits for all students, it is advisable for the rating scale to include such a 
category as "instructor uncertain," or “по opportunity to observe." Inclu- 
sion of such an option helps to increase the accuracy of teacher ratings. 
Space for comments on illustrative behavior or for other supporting evi- 
dence is another desirable feature of some rating scales. 


Improving the Rating Process 


In rating, as in all other evaluation techniques, the teacher is concerned 
not only with the quality of the instrument used but with the validity and 
reliability of the actual scores or ratings obtained. Hence, the process of 
assigning ratings is important. In using any rating scale (for example, the 
"personality and character" section of the report card), the teacher should 
rate all students on a single trait, preferably by sorting the cards into 
groups representing “high,” “average,” and “low” or whatever rating cate- 
gories are being used. After the sorting has been completed, the teacher 
can reexamine his judgments concerning the students assigned to each cate- 
gory to see whether any students should be reassigned to higher or lower 
ratings. Ratings are then entered for the trait, and the same procedure is 
repeated for each trait of the rating scale. 

This process is suggested to minimize “halo effect"—that is, the effect 
of the teacher's general or over-all impression of a student on his rating 
of the specific traits. If a student rates unusually high or low on personality 
traits that are important to the teacher, the teacher's ratings on his 
other traits are likely to be affected positively or negatively by this “gen- 
eral impression.” The sorting procedure just described will help to mini- 
mize this source of error. Also, as the teacher sorts cards, he will note 
the relative number of students he has assigned to each category and 
remember that approximately as many students should be assigned below- 
average as above-average ratings. Thus he can minimize the generosity 
error, or the bias toward high ratings evident in the results of most rating 
procedures.”° 


15 n the absence of knowledge to the contrary, it is best to assume that students 
are normally distributed with respect to a given trait. Hence, if there are three steps 


Studying Personal-Social Adjustment s 281 


OBTAINING THE OPINIONS OF OTHERS: 
SOCIOMETRIC TECHNIQUES 


We have studied techniques concerned with how the student sees himself, 
how he behaves in natural and specially devised situations, and how his 
teachers record summary judgments on rating scales. The techniques in 
this final section are concerned with how the student's classmates see him 
and with his degree of social acceptance by them. 

Although the teacher may gain considerable information about students’ 
Social relationships through observation, results of class elections, and the 
like, only sociometric techniques reveal how students would like to asso- 
ciate and how their wishes compare with the attitudes of other students 
toward them. 


Selecting the Questions 


The first step in making a sociometric study is to choose questions that 
will stimulate students to reveal their true feelings about other members 


of the class. Questions of the following types are appropriate: 


1. Whom do you wish to sit next to in the classroom? 

2. With whom would you like to work on a committee? 

3. Whom would you like as companions on a class project? 
4. Who are your best friends? 

5. With whom do you like to associate after school? 


Of the five questions listed above, the first three are ones in which the 
choices can actually be put into effect. Students are likely to respond more 
frankly if they feel that they will benefit from an honest report of their 
feelings. Questions 4 and 5 provide information about close interpersonal 
relationships but indicate no reason for seeking the information. 

The questions also differ with respect to the basis for choice. The first 
and fourth questions imply friendship and pleasure in proximity. The sec- 
Ond and third may bring in the additional elements of interest, skill, work 
habits, and the like. The fifth question, although it involves personal friend- 


20 percent should receive the highest rating, 60 
ercent the low rating. If the rating scale is a four- 
mately 11 percent each in the lowest and 
f the middle categories. With a five-point 
d for grading "on the curve," that is, 


on à rating scale, approximately 
Percent the middle rating and 20 p 
Step one, the percentages should be approxi 
highest categories, and 39 percent in each o 
Scale, the percentages are those so often cite ; 
7 percent, 24 percent, 38 percent, 24 percent, and 7 percent. If six or more steps are 
used, the percentages corresponding to six or тоге equal divisions of the base line 
Of the normal curve are obtained in a similar way- 


282 THE STUDY OF INDIVIDUALS 


ship, may be conditioned by such factors as living in the same neighbor- 
hood, membership in clubs, and skill in athletics. 

All the questions or criteria listed above involve positive reactions, ог 
attractions. Negative questions implying rejection have occasionally been 
used. Although their use in a research study can be justified, their use by 
teachers probably cannot. An observant teacher is usually aware of students 
who are actively rejected and need not focus class attention on them by 
asking for negative choices. Requesting, or even seeming to condone, nega- 
tive choices seems contradictory to the teacher's positive expectation that 
each student should be willing to work with, or sit by, any of his classmates. 

If one wishes to obtain reliable data concerning the relative social ac- 
ceptance of individuals, it is desirable to request five choices rather than a 
smaller number.’ From the third grade on, children seem to experience 
little difficulty in making five choices. Children in grades one and two are 
usually able to make at least three choices for each criterion.'* 


Administering the Questions 


The sociometric test should be administered in an informal and natural 
manner. Ordinarily, the teacher distributes 4 in. by 6 in. slips of paper on 
which the student writes his name and lists his numbered choices. If more 
than one question is used, duplicated forms may be advisable. Although 
the teacher will wish to give the directions informally rather than read 
them, he should think through his presentation carefully and know what 
he wishes to say. Jennings has summarized the essential pointers for good 
administration as follows: 


Teachers should always feel free to answer any questions that may occur to 
the group, both before and during the writing, and should treat the occasion in 
a business-like manner. The most important things to remember about adminis- 
tering the test are: (1) to include the motivating elements in the introductory 
remarks, (2) to word the question so that children understand how the results 
are to be used, (3) to allow enough time, (4) to emphasize any boy or girl, 
so as to approve in advance any directions the choice may take, (5) to present 
the test situation with interest and some enthusiasm, (6) to say how soon the 
arrangements based on the test can be made, and (7) to keep the whole pro- 
cedure as casual as possible.'* 


15 Use of five choices has been found to result in the most stable sociometric data 
according to research studies summarized in Norman E. Gronlund, Sociometry in 
the Classroom (New York: Harper & Row, Publishers, Inc., 1959), p. 48. 

17 Ibid., р. 48. 

15 Helen Hall Jennings, Sociometry in Group Relations (Washington, D.C.: Amer- 
ican Council on Education, 1948), p. 16. 


Studying Personal-Social Adjustment 283 


Teachers may find that providing a duplicated list of classmates’ names 
lessens confusion, discourages discussion, and reminds students of absen- 
tees. If desired, such a list may be numbered, with students recording the 
numbers, rather than the names, of classmates chosen. 


Constructing the Sociogram 


The general procedure in drawing a sociogram is to locate individuals 
on a chart so that the "stars" (the most frequently chosen) are near the 
center; the "isolates," or unchosen, are on the periphery; and other stu- 
dents are located so as to minimize the number of long lines and the number 
of intersecting lines. 

As a general principle, it is well to start the sociogram for either girls or 
boys by drawing in and labeling the symbols for those who are "stars" 
and for their mutual friends. Figure 8.3 is a sociogram depicting the boys’ 
choices, as summarized in the tabulation form (Figure 8.2). The first sym- 
bols entered on this chart would be those for John A. and Harry Е. (the 
most frequently chosen boys) and their mutual friends, Raymond F. and 
Roger B. Symbols for additional boys chosen by this group are entered 
next (Fred D., George G., and Walter H.). The symbols are placed so as 
to minimize long and intersecting lines. Finally, symbols and arrows for all 
boys still uncharted are entered, the decision on placement in each case 
being based on choices given and received. Isolates should be located near 
the periphery of the chart and placed so that lines can be most easily drawn 
to their choices. T" 

Figure 8.3 shows that John A. and Harry E. are “stars,” each receiving 
Six choices. John's choices of Harry and Raymond are mutual. Although 
Raymond received only four choices (two of them from fringers in the 
group), he occupies a position of social importance 1n the class as a mutual 
friend of John A. and the first choice of Harry E. Andrew I. is the only 
isolate in the strict sense of the word; however, Peter C. received only one 
choice (from the isolate Andrew) and had none of his three choices re- 
ciprocated. Barry J. received only one choice (a third choice from Walter 
H.). He made only two choices, failing to use his full quota; neither of the 
choices he made was reciprocated. 

The boys of this class look for leadership to two boys who are mutual 
friends, Jóhn A. and Harry E. These two boys include as a close personal 
friend Raymond F., whose only other choices, however, come from two 
of the less popular boys. All the boys in the class are tied, although some- 
what loosely, into one psychological network. However, there are few 
mutual choices; the boys seem to be held together as a social group chiefly 
through the mutual friendship of the two “stars.” 


d 
284 THE STUDY OF INDIVIDUALS 


Chosen 
Chooser 


x 
Ф 
^ 
2 
o 
2 
= 


Chosen as 
1st choice 


*Did not make a third choice. 


Fig. 8.2 Sociometric Tabulation Showing the Choices of Ten Boys. 


Interpreting Sociometric Data for Individuals 


In interpreting sociometric data for individual students, the teacher is 
naturally most concerned about the isolates (those students who are not 
chosen) and the neglectees (those who are very infrequently chosen even 
when one requests five choices on two or more criteria).!? Several ques- 
tions about the isolates and neglectees need to be asked before the seri- 
ousness of their isolation and the suitable remedial procedures can be 
determined. 


19If five choices are requested on a single question (or criterion) a pupil with 
one choice is considered a "neglectee" because there are only two chances in 100 
that he would receive so few choices if only chance were operating. Similarly, a 
pupil with nine or more choices would be called a "star." If five choices on two 
criteria are used, a neglectee is one receiving four or fewer choices; a star, one 
receiving sixteen or more. If five choices on three criteria are used, a neglectee is 
one receiving nine or fewer choices, while a "star" is one who receives twenty-two 
or more, Urie, Bronfenbrenner, and Henry Ricciuti, "The Appraisal of Personality 
Characteristics in Children," in Paul H. Mussen, ed., Handbook of Research Methods 
in Child Development (New York: John Wiley and Sons, Inc., 1960), pp. 770-817. 


Studying Personal-Social Adjustment 285 


ГАК 1 —— To first choice 


2—— To second choice 


C) 3—— To third choice 
Boys 
=== Mutual choice 


Fig. 8.3 А Sociogram Based on the Data Shown in Figure 8.2 


so isolated in home and neighborhood 
oup counterbalanced, in part, by 
frequently occurs when the basis 


- Is the student who is isolated in class al 
groups, or is his position in one social gr 
his position in another? (This phenomenon 
for isolation is ethnic group or social class.) 


Is the student relatively new to the class? 
. How old is the student? (If a student in the fifth grade or above is isolated, 


his isolation is probably due to group attitudes toward him, while an isolated 

child in kindergarten or the primary grades may simply be a shy child who 
is ignored.) 

4. Is the isolate realistic about his position? (Some isolates are quite realistic 
about their social status in the class, indicating their desire to associate with 
one or two students with whom there is some basis for establishing friend- 


wn 


286 THE STUDY OF INDIVIDUALS 


ship. Other isolates are quite unrealistic, listing the most popular members 
of the group as friends.) 


Among the isolates and neglectees, the teacher is almost certain to find 
students with adjustment problems that he might otherwise have over- 
looked. The teacher will probably find among the “stars” some happy, 
likable students who rank high in many aspects of personal-social adjust- 
ment. The fact that a student is a “star,” however, does not necessarily 
imply optimal personal-social adjustment. Among individuals who make 
persistent efforts to achieve popularity and leadership will be found some 
whose drive arises out of deprivations in other areas of adjustment—for 
example, affectional relationships in the family. These relationships empha- 
size again that data from several approaches or techniques must be com- 
bined if the teacher is to achieve an understanding of the student and his 
adjustment problems. 


Validity and Reliability of Sociometric Data 


If students are asked sociometric questions and promised that their 
choices will be considered in assignment to seats or to class committees, 
they are likely to give their actual preferences (as of that time and with 
respect to the questions used). In a sense, their choices are criterion data; 
and we need not be concerned with relevance.” Our only concern appears 
to be with reliability. 

Data concerning the stability of sociometric choices are encouraging. 
Gronlund? found an average stability coefficient of .75 over a four-month 
interval when he studied groups of children in grades 4—6; while Bonney,”* 
who has conducted several studies on stability of choices, found that sta- 
bility coefficients obtained over a one-year period ranged from .67 to .84. 
Requesting five choices for two or more questions will yield sociometric 
scores of considerable stability, if the study is made in the upper elementary 


20 ТЕ (ће sociometric questions are ones that do not imply fulfillment of choices, 
such as “Who are your best friends?” we cannot take validity for granted. In one 
research study, however, Byrd found correlations of approximately .80 between (1) 
the number of choices each student received when classmates were asked to list 
those they preferred to work with on a class play and (2) the number of choices 
each student received when “real life” choices were made. 

21 Norman E. Gronlund, “The Relative Stability of Classroom Social Status with 
Unweighted and Weighted Sociometric Status Scores,” Journal of Educational Psy- 
chology, vol. 46 (October 1955), pp. 345-354. 

22 М. E. Bonney, “The Constancy of Sociometric Scores and Their Relationship 
to Teacher Judgments of Social Success and to Personality Self Ratings," Sociometry, 
vol. 6 (November 1943), pp. 409-424. 


Studying Personal-Social Adjustment 287 


grades or above. Studies made in the kindergarten and primary grades 
show somewhat lower reliability. 

If "total choices received" is presumed to represent the construct of 
“general social acceptance," we must be concerned with the amount of 
consistency in the student's social acceptance from one social group to 
another, and from one criterion to another. Gronlund and Whitney? ob- 
tained an average r of .72 between number of choices each student re- 
ceived as seating companion (within the classroom) and the number 
he received as future classmate (throughout the school). 

When the consistency of sociometric status from one criterion to another 
has been studied, the correlations have naturally varied in some degree with 
the similarity of the criteria. In a study of 1258 sixth-graders, by Gron- 
lund,? five choices were requested for each of three criteria: seating com- 
panions, work companions, and play companions. The intercorrelations 
among criteria ranged from ‚76 to .89. The highest correlations were be- 
tween number of choices received as seating companion and as work 
companion; the lowest correlations were between work and play criteria. 


Northway suggests the following hypothesis: 


An individual's acceptance score as measured in one group 15 а reliable index 
to what his acceptance score will be in a reasonably similar (culture-age) group. 
That is, his acceptance score is an outward measure of a psychological char- 


acteristic called acceptability.”° 


This hypothesis should not, however, be interpreted as an assurance 
that data from any sociometric testing program may be taken as valid 
evidence of a pupil’s general social acceptability, unless the following con- 
ditions are present: (1) the pupil has been well motivated in taking the 
Sociometric test; (2) there have been opportunities within group situations 
for building up social interrelationships; and (3) the criteria or questions 
used were designed to reflect general social acceptance, such as choice of 
Seatmates or friends, rather than choice of pupils for a relationship involv- 
ing special qualifications, such as pupils “to help you in arithmetic” or 
choice of pupils for such a position as class secretary. 


23 Norman E. Gronlund, and A. P. Whitney, “Relation between Pupils’ Social 
Acceptability in the Classroom, in the School, and in the Neighborhood,” School 
Review, vol. 64 (September 1956), pp. 267-271. 


21 Norman E. Gronlund, “Generality Of Sociometric Status over Criteria in the 


Measurement of Social Acceptability,” Elementary School Journal, vol. 56 (Decem- 


ber 1955) 173-176. 
25 Mary T Northway, Appraisal of the Social Development of Children at a 
Summer Camp, Psychology Series, vol. 5, No. 1 (Toronto: University of Toronto 


Press, 1940). 


288 THE STUDY OF INDIVIDUALS 
Interpreting Data on Group Structure 


An important function of sociometric techniques is to reveal the extent 
of integration or cohesiveness of a group. The following questions will assist 
the teacher in appraising the social structure of his class: 


1. What is the leadership structure within the group? If there are two or more 
well-organized groups, what are the attitudes of their leaders toward each 
other? Is the division into groups based on cleavages that appear in adult 
society (ethnic group, nationality, and social class)? 

2. How integrated or cohesive is the class social structure? If questions are 
used that permit outside-class choices, what percentage of choices were made 
within the class? To what extent have mutual choices been made? Are there 
any groups that are isolated from, or rejected by, the rest of the class? Are 
there any evidences of maladjustment or delinquency in these isolated or 
rejected groups? 

3. What do the majority of most-chosen students have in common (high social 
status, religious denomination, length of residence in the community, par- 
ticipation in after-school activities, and the like)? What do the neglectees 
have in common (nationality, lower socioeconomic status, newness to com- 
munity, or residence in trailer camp, housing project, or orphanage)? 


In the analysis of a sociogram, the age of the students must always be 
considered. In the kindergarten and primary grades, one is likely to find a 
relatively large number of isolates and a low number of mutual pairs. 
Choices are highly unstable, especially in the kindergarten and first grade. 

In working for reorganization of social groups, the teacher may find it 
especially useful to request data from students on both their interests and 
their choices of friends. Such a procedure allows the teacher considerable 
latitude in the formation of committees, for he may safely disregard the 
first and second choices of associates made by a student leader in the 
interests of honoring his first choice on activities. In this way, isolates can 
be placed with students with whom they would like to associate; clique 
members can be distributed among several committees on the basis of 
choices they direct toward nonclique members; clique leaders can be placed 
in a position of responsibility on committees involving several members of 
another clique; and a variety of patterns may be tried out. The teacher can 
probably be somewhat more venturesome in arranging short-lived commit- 
tees for a party than in planning long-term committees involving earnest 
cooperation and serious work on some classroom project. 

In satisfying the friendship choices of individuals, the following princi- 
ples have been found to be justified in practice: 


1. In order to carry out as many expressed wishes as possible, it is generally 
best to start with the children who have not been chosen at all or only seldom. 
It is usually better to give an unchosen pupil his own first choice. For example, 


Studying Personal-Social Adjustment 289 


if David chooses Patty first, Lee second, and Willard third, and no one chooses 
him—then David should be placed with Patty. 

2. Give any pupil in a pair relation the highest reciprocated choice from 
his point of view; his first choice if this is returned; his second if this is returned 
and his first is not, or his third if this is his only reciprocated choice on his list. 

3. If a child has received choices only from people other than the ones he 
chose, then give him his first choice. 

4. Make sure that each child has been placed with at least one of his 
choices.2¢ 


Peer-nomination Techniques 


Peer judgments are also used in reputation or peer-nomination question- 
naires, in which the student is asked to name classmates whom certain 
“word pictures” in the test seem to describe. The following directions are 


illustrative: 


Here are some little word pictures of children you may know. Read each 
Statement carefully and see if you can guess whom it is about. It might be 
about yourself. There may be more than one picture for the same person. 
Think over your classmates and write after each statement the names of any 
boys or girls who may fit it. If the picture does not seem to fit anyone in your 
class, put down no name, but go on to the next statement. Work carefully and 


use your judgment.?* 


Each word picture should be a brief description rather than a mere trait 
name. In research studies it has been found advantageous to use pairs of 
contrasting items, although they should not appear together in the question- 
naire, The following pair is quoted from Tryon's study of peer judgments 
at different age levels: 

1. Here is someone who finds it hard to sit still in class; he moves around 
in his seat or gets up and walks around. 


2. Here is someone who can work 
Seat,?s 


quietly without moving around in his 


5, ор. сії., P- 73 
f the Classroom," unpublished mimeo- 
d Teacher Personnel, Commission 


ч Jennings, Sociometry in Group Relation 
27 Stuart Stoke, “The Social Analysis 0 
graphed report, Division on Child Development an 


for Teacher Education, January 1940. 7 
28 Caroline Tryon, Evaluations of Adolescent Personality by Adolescents, Mono- 


Braph of the Society for Research in Child Development, vol. 4, No. 4 (Washington, 
D.C.: National Research Council, 1939). 


290 THE STUDY OF INDIVIDUALS 


Space is left after each word picture for the student to list the names of 
classmates who fit the description. For classroom use, the teacher should 
avoid the use of word pictures that would be viewed negatively by the peer 
group. In the example above, the negatively worded statement would be 
quite satisfactory since variations in restlessness are socially acceptable. 

One cannot use the number of mentions a student receives for a specific 
trait as an index of how he ranks on this trait in comparison with other 
students. А student who is popular will be mentioned very frequently for 
the desirable attributes; a student who is rejected will be mentioned very 
frequently for the undesirable attributes. For this reason, one should not 
study the results with respect to one trait at a time, picking out this boy as 
most restless, this one as most friendly, and the like. Rather, the entire 
pattern of mentions for an individual should be studied for significant clues 
in understanding and helping him. The findings for a given student repre- 
sent intraindividual differences, as perceived by his peers. 

Peer-nomination techniques have proved to be one of the most depend- 
able rating techniques. The number of "raters" is very large. Moreover, 
students are asked to list only those classmates who fit the descriptions; 
they are not asked to differentiate among those in the average range. An- 
other advantage of peer-nomination techniques is that a student's peers 
are in a position to observe his behavior in many informal situations in 
which he is under no pressure to show outward conformity to adult stand- 
ards. Hence peers obtain evidence on many interpersonal traits that is not 
available to teacher observers in more formal situations. Finally, peer 
nominations are significant in that they represent the environment of peer- 
group opinion in which the child lives. 

Gardner and Thompson have developed an adaptation of the peer- 
nomination scale that permits comparison of data for different groups, as 
well as comparisons of individuals in different school classes.?? When the 
teacher summarizes data for his class he can ascertain for each pupil (1) 
the way he regards each of his classmates as a potential satisfier of two 
important social needs and (2) the way his classmates view him as a sat- 
isfier of these needs. 

Before the student begins filling in his reactions to his classmates, he 
establishes a frame of reference for rating by selecting individuals (inside 
or outside school) who represent for him most, average (or medium), and 
least need satisfaction, as well as two intermediate points between medium 
and each of the two extremes. Then each of his classmates is rated as 
better or less good than one of these five persons in helping him to satisfy 


29 Eric F. Gardner and George C. Thompson, The Syracuse Scale of Social Rela- 


tions (New York: Harcourt, Brace & World, Inc., 1958). 


Studying Personal-Social Adjustment 291 


the specified need. In the example given in an explanatory article,?? John 
chose his mother as the one whom he would most like to go to when he 
is troubled with some personal problem. He chose one of his girl classmates 
for the least rating and an uncle for medium. Another classmate and a 
neighbor were assigned to the intermediate positions. The needs, in terms 
of which students rate their classmates, are as follows: 


1. A possible source of aid when troubled by a personal problem (included at 


all three levels) | : | 
2. Someone to help him to do something well so people will praise him (ele- 


mentary level, grades 5-6) -— 
3. Someone to look up to as an ideal (junior high) у | 
4. А person whose company he would enjoy at a party or recreation (senior 


high) 


The first question is used at all levels; the second question varies with the 
School level. 

When the Syracuse scale has been used by research workers, the sum- 
liability coefficients approximating .90. Mid- 
igns to others, as well as the midscore of 
d in terms of percentile norms, based 
housand students at each grade level. 


mary scores have shown re 
Scores of ratings the student ass 
ratings he receives, can be interprete 
on a norming sample of more than a t 


Cautions in Using Sociometric Techniques 


The primary purpose of using sociometric techniques is to increase the 


teacher's understanding of the social relationships existing within the class 
So that he can help students to improve in social acceptance and in desir- 
able social traits. One of the criticisms that is sometimes made of socio- 
metric techniques is that they encourage the students to think critically of 
one another and so tend to crystallize antagonisms of a minor and tempo- 
rary nature, widen cleavages between cliques, and intensify or make 
conspicuous the rejection by classmates of students who are socially iso- 
lated. Obviously, no teacher wants results of this type. | 

A number of suggestions have already been made that are intended to 
minimize the possibility of damaging after-effects. These include (1) using 
questions in which the choices made by students are actually put into 
effect; (2) emphasizing positive preferences rather than rejections or dis- 
likes; (3) keeping the whole procedure as casual as possible; (4) manag- 
ing efficiently so that questions and discussion of procedure can be kept 


30 Eric F. Gardner and George G. Thompson, “Measuring and Interpreting Social 
Relations,” Test Service Notebook, No. 22 (New York: Harcourt, Brace & World, 


Inc., 1959). 


292 THE STUDY OF INDIVIDUALS 


to a minimum; and (5) giving careful consideration to the sociometric 
data when forming committees and other working groups. In addition, the 
teacher must be careful to avoid, in both his speech and his behavior, 
conveying to the students the impression that popularity is the most im- 
portant sign of personal worth. Finally, the teacher should regard sociometry 
as an exploratory technique in the study of students; when a student is 
discovered to be a neglectee, much more information needs to be obtained 
before the teacher has a sound basis for diagnosis. 


SUMMARY STATEMENT 


Three overlapping concepts of mental health or personality development 
were examined in this chapter—those of maturity, normality, and adjustment. 
The concept of "maturity" is a valuable one for teachers inasmuch as they 
can readily categorize a student's behavior as characteristic of his own age 
group or of a younger group. The term "normality" sometimes implies average 
or typical; at other times it is used in the sense of health vs. illness. The term 
"adjustment" is most frequently used to describe a person's relationships to 
his environment and therefore is a reasonable criterion of mental health only 
if the demands of the environment are reasonable. It is important also that a 
person's adjustment to environmental pressures be made without developing 
persistent internal conflicts. 

Two major approaches to personality description have been used: (1) the 
psychometric approach, in which one attempts to obtain quantitative estimates 
of different personality dimensions, and (2) the clinical approach, in which 
one uses a variety of techniques to obtain clues concerning the individual's 
needs and problems. Ideally, we should use the techniques best adapted to a 
specific situation, keeping in mind that data from any one source should be 
used only for hypothesis formulation and that conclusions are justified only 
when data from several independent sources converge to support an hypothesis. 

Data concerning the personal-social adjustment of individuals can be ob- 
tained from: (1) self-report (through questionnaires, interviews, or autobi- 
ographies), (2) observation of relevant behavior, (3) projective techniques, or 
(4) obtaining the opinions of others (through rating scales, sociometric tech- 
niques, and other means). A variety of these techniques were considered in 
this chapter except that techniques requiring specialized training and supervised 
experience were deferred to Chapter 9, namely: projective techniques and one 
type of self-report, that is, personality inventories. 

In interviewing students, valuable results can be obtained when good rapport 
is established, effective communication is maintained, and the teacher has 
familiarized himself with the background material. In their autobiographies, 
students may reveal a great deal about themselves, their interests, fears, wishes, 
self-concepts, and the like. 

Direct observation of student behavior can provide highly significant data, 
especially during self-directed activities, group activities, such as role playing, 
and extracurricular activities. The skilled observer distinguishes between ob- 
servation of behavior and its interpretation. Behavior journals and anecdotal 


Studying Personal-Social Adjustment 293 


records help to systematize observations and to provide a useful summary of 
observations over a period of time. 

, There are several techniques of appraising the student's personal-social ad- 
justment as evaluated by others. Teachers’ observations can be summarized 
concisely on checklists or rating scales. The desirable characteristics of rating 
Scales were presented and illustrated. Suggestions for improving the rating 
process were also given. Teachers can construct sociometric charts on the 
basis of students’ answers to one or more questions involving a choice of their 
Classmates (as seatmates, fellow-committee-members, or best friends). As 
finally charted in a class sociogram, sociometric data aid the teacher in apprais- 
ing the social structure of his class and in identifying students who need help 
in achieving satisfying peer relationships. Peer-nomination techniques can be 
used to advantage in certain situations. However, the instruments must be care- 
fully devised, administered, and interpreted if the results are to be of maximum 
value. 


SELECTED REFERENCES 


ADKINS, DOROTHY C., "Principles underlying Observational Techniques of Evalu- 
ation,” Educational and Psychological Measurement, vol. 11 (Spring 


1951), pp. 29-51. | a й | 
BASS, BERNARD M., “The Leaderless Group Discussion, Psychological Bulletin, 


vol. 51 (September 1954), pp- 465-492. | | 
UNSERE DES. URIE, AND HENRY RICCIUTI, “The Appraisal of Personality 
Characteristics in Children," in Paul H. Mussen, ed., Handbook of Re- 
search Methods in Child Development. New York: John Wiley and Sons, 


Inc., 1960, pp. 710-817. 
DRISCOLL, GERTRUDE P., How 

Bureau of Publications, 2 : 
FLANAGAN, JOHN C., "The Critical Incident Technique, 

vol. 51 (July 1954), pp- 327-357. = о zr 
—, ше P Tool for Measuring Children's Behavior," Elemen- 


—166. 
tary School Journal, vol. 59 (December 1958), pp. 163-166. 

GRONLUND, NORMAN E., Sociometry in the Classroom. New York: Harper & 
Row, Publishers, Inc., 1959. 3 ; | Гый 

HEYNS, R. W., AND R. LIPPITT, "Systematic Observational Techniques,” in 
Gardner Lindzey, ed., Handbook of Social Psychology. Cambridge, Mass.: 
Addison-Wesley Publishing Company, Inc., 1954, Chapter 10. 

HORN, ALICE, AND ALFRED S. LEWERENZ, “Measuring the ‘Intangibles’ in Edu- 
cation,” California Journal of Educational Research, vol. 1 (September, 
November 1950), pp. 147-153, 195-206. E 

HUTSON, P. W., a Studies in Character-Trait Rating,” Personnel and 
Guidance Journal, vol. 38 (January 1960), рр. 364—368. | 

JARVIE, L. L., AND M. ELLINGSON, 4 Handbook of the Anecdotal Behavior 


Journal. Chicago: University of Chicago Press, 1940. Е 
LANGDON, GRACE, AND IRVIN W. STOUT, Teacher-Parent Interviews. Englewood 
Я A 


Cliffs, N.J.: Prentice-Hall, Inc., 1954. = А 8 
MAYO, GEORGE D., "Peer Ratings and Halo," Educational and Psychological 


Measurement, vol. 16 (Autumn 1956), pP- 317-323. 


to Study the Behavior of Children. New York: 
Teachers College, Columbia University, 1945. 
" Psychological Bulletin, 


294 THE STUDY OF INDIVIDUALS 


MORENO, J. L., Who Shall Survive? 2d ed. New York: Beacon House, Inc., 


1953. 
NORTHWAY, MARY L., А Primer of Sociometry. Toronto: University of Toronto 
Press, 1952. 


PEAK, HELEN, "Problems of Objective Observation," in Leon Festinger and 
Daniel Katz, eds., Research Methods in the Behavioral Sciences. New 
York: Holt, Rinehart and Winston, Inc., 1953, pp. 243-299. 

PROCTOR, C. H., AND C. P. LOOMIS, "Analysis of Sociometric Data," in M. 
Jahoda, M. Deutsch and S. Cook, eds., Research Methods in Social Rela- 
tions, Part П. New York: Holt, Rinehart and Winston, Inc., 1951, Chap- 
ter 17. 

TABA, HILDA, AND OTHERS, Diagnosing Human Relations Needs. Washington, 
D.C.: American Council on Education, 1950. 

TAYLOR, DOROTHEA, “How to Obtain Autobiographies,” Personnel and Guidance 
Journal, vol. 36 (February 1958), pp. 426-427. 

THORPE, LOUIS P., AND OTHERS, Studying Social Relationships in the Classroom. 
Chicago: Science Research Associates, 1959. 

WOLFE, DON M., “Fruitful Long Paper: The Autobiography," The English 
Journal, vol. 45 (January 1956), pp. 7-12, 38. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. What important purposes may be served by pupil interviews at the grade 
level at which you teach? 

2. Evaluate the autobiography as a method of child study. 

3. Observe children on the playground, and make a record of behaviors in- 
dicative of feelings of self-assurance. 

4. Why is observation an important and useful method of studying children? 
In what ways can observational methods be improved? 

5. List several important precautions to be adopted by teachers in writing 
and interpreting anecdotal records. 

6. What are the limitations of subjective techniques of evaluation? How may 
they be overcome? 

7. Summarize a published case history and evaluate it. 

8. Obtain from the local high school rating scales used in physical educa- 
tion, industrial arts, homemaking, or other classes. Criticize these in light of 
the principles developed in this chapter. 

9. Which of the following traits would probably be most difficult to rate 
reliably on the basis of repeated observations? (a) promptness, (b) neatness, 
(c) leadership ability, (d) integrity, (e) interest in a subject. 

10. Imagine that you are to present to your faculty coworkers the sociogram 
for the boys of your class (shown in Fig. 8.3). Outline briefly your explanation 
of the techniques, together with any implications suggested by the data. 

11. Evaluate sociometry as a method of studying children and adolescents. 
List its advantages and limitations. 


Personality Inventories 


and Projective Tests 


Some psychologists tend to consider personality as an indefinable whole— 
Very complex in nature and not susceptible to analysis. Other psychologists 
consider this point of view mystical, vague, and of little value in practice. 
They would define personality from a psychometric point of view as a pat- 
tern of traits or ways of reacting to environmental stimuli. Fortunately, an 
Increasing number of psychologists are combining these two views. They 
recognize that measurement of personality can proceed only through at- 
tempting the identification of personality components that are definable, 
relatively independent, and reasonably homogeneous or unitary in nature. 
They concede, however, the limitations of present attempts to identify and 
measure personality components. They concede also the critical importance 
of the individual's integrating his various traits or reaction patterns into а 
Smoothly working, effective whole. 

We are presenting in this chapter two quite difiere ) X 
study of personality, that is, structured personality inventories and semi- 
Structured projective tests. They have been grouped together because their 
use by unqualified persons is hazardous, and their value depends a great 
deal on the examiner's background in psychology and his ability to synthe- 
Size information from many sources as à basis for hypotheses about the 


individual under study. 

In a sense, these two approaches repres 
Psychometric vs. the clinical approach discus 
Persons who have considerable faith in one о 
to distrust the other. 

The personality inventories discussed in the 
Called structured inventories. A test is said to be 
designed that all examinees interpret the items in the same way. Although 


te different approaches to the 


ent the two extremes of the 
sed in the preceding chapter. 
f these approaches are likely 


first chapter section are 
structured when it is so 


295 


296 THE STUDY OF INDIVIDUALS 


there 15 some ambiguity, and consequently some variation in the subject's 
interpretation of inventory items, the authors of structured inventories try 
to minimize such ambiguity. 

Projective tests involve a minimum of structuring. In projective tests, 
ambiguous content, which permits a variety of interpretations, is valued. 
In interpreting ambiguous stimuli, such as ink blots or vague indefinite 
pictures, the subject is encouraged to react in a highly individualized man- 
ner. He may be told, for example, that "any story will do." The clinical 
psychologist assumes that the subject tends to project into these ambiguous 
stimuli his own wishes, fears, and repressed conflicts. 

The style with which the subject works with paint or clay, or the extent 
to which he uses form, color, and shading in the interpretation of an ink 
blot, is thought to reveal clues concerning his characteristic approach to 
real-life situations. Much more research is needed to validate the hypotheses 
now utilized in the interpretation of projective tests, Fortunately, well- 
qualified psychologists check hypotheses developed from projective tests 
with data from other sources such as interviews and observations. 

The person who prefers projective tests likes to observe the reactions of 
his subjects in situations in which they have opportunities to be individual- 
istic and creative, where their defenses are down and they are caught off 
guard. Many clinically oriented psychologists contend that only such ap- 
proaches produce valid data. The psychometrically oriented psychologists, 
on the other hand, tend to distrust the subjectivity of interpretation of pro- 
jective tests and to question the hypotheses used as guide lines in such 
interpretation. 

Most of the personality inventories have been developed, standardized, 
and studied by psychologists with a psychometric orientation. АП students 
taking a personality inventory are presented with a uniform set of ques- 
tions; the person's responses are objectively scored according to predeter- 
mined keys, and the results are interpreted in comparison with norming 
samples. As we will see in this chapter, however, the interpretation of 
results from such inventories cannot be routine and objective. 

While personality inventories are self-report questionnaires, the projec- 
tive tests involve observation of the individual's performance as he tells a 
story, draws a picture of his family, or utilizes various dolls and back- 
ground settings in dramatic play. Qualitative interpretations of the person's 
style or method of attack may be more important than any quantitative 
evaluation of the product. Projective tests present unstructured or semi- 
structured situations that permit a variety of perceptions and interpretations. 
In fact, the examiner encourages the individual to allow free rein to his 
responses and reassures him that his own individualized interpretations are 
desired. 


Personality Inventories and Projective Tests 297 


PERSONALITY INVENTORIES 


In the early days of personality measurement, psychologists attempted to 
develop inventories that measured single aspects of personality. Allport 
Was concerned with ascendance-submission; Marston and others with ex- 
troversion-introversion. Soon, however, a number of psychologists attempted 
to meet the need of psychologists, research workers, and others for a test 
that would describe many aspects of personality, which would provide a 
personality profile comparable to the achievement profile obtained by the 
use of an achievement-test battery. A number of self-styled personality 
inventories were developed, each based on the author's own list of person- 
ality traits. Eventually, reactions developed against such test-construction 
practices. As Spencer said, *Personality traits cannot be created by the 
Psychologist.” 


Studies of Components or Factors in Personality 


One of the basic research problems in the measurement of personality 
is to identify relatively independent personality traits, As one might expect, 
factor analysis has not met with as good results in personality as in ability 
Measurement, Researches to date have shown less agreement. 

Cattell, Guilford, and others have applied the techniques of factor 
analysis in an attempt to find personality traits that are unitary or homo- 
geneous—that is, groupings of specific elements of behavior that tend to 
80 together in individuals, which are functionally interrelated. Behavior can 
be described in very small units, such as “jumping at the barking of a dog,” 
ог “biting one’s finger nails,” as is done in observational records. However, 
ї would make for economy of description and would certainly facilitate 
Personality measurement if it were established, for example, that “nervous- 
Dess" were a unitary trait of personality. . . 

Cattell found that there were approximately 5500 personality trait 
names in the English language. А selected list of 171 personality trait 
names was first developed by examining the original 5500—eliminating the 
trivial and grouping together those that were obvious synonyms. AII later 
simplification of the list, however, was done by the statistical procedure 
known as "cluster analysis." As a result of his studies, Cattell identified 
35 surface traits; that is, traits that were evidenced in surface behavior and 
that seemed to include (on the basis of statistical analysis) the character- 


Conflict, A New Approach to Personality Meas- 


1 Douglas Spencer, The Fulcra of 1939), p. 20 
d, Inc., sp 29. 


итетет (New York: Harcourt, Brace & Worl 


TUDY OF INDIVIDUALS 
298 THE S 


istics present in the original list of 171 qualities. Further statistical analysis 
revealed source traits that were more nearly basic and accounted for many 
of the surface traits. Of the source traits that he has identified, Cattell con- 
sidered six to be best established.? 


SOURCE 
TRAIT CHIEF CHARACTERISTICS 
A  Good-natured, easy-going, vs. Spiteful, grasping, critical, ob- 
ready to cooperate, attentive structive, cool, aloof, hard, 
to people, soft-hearted, suspicious, rigid, cold 


kindly, trustful, adaptable, 
warm-hearted 


B Intelligent vs. Mentally defective 

C Emotionally mature, emotionally vs. Lacking in frustration-tolerance, 
stable, calm, phlegmatic, re- changeable, showing general 
alistic about life, absence of emotionality, evasive, neurot- 
neurotic fatigue, placid ically fatigued, worrying 

E Assertive, self-assured, inde- vs. Submissive, dependent, kindly, 
pendent-minded, hard, stern, soft-hearted, expressive, con- 
solemn, unconventional, ventional, easily upset, self- 
tough, attention-getting sufficient 


F Talkative, cheerful, placid, vs. Silent, introspective, depressed, 
frank, expressive, quick, alert anxious, uncommunicative, 
smug, languid, slow 


I Demanding, impatient, depend- vs. Emotionally mature, independ- 


ent, immature, imaginative, ent-minded, set and smug, 
introspective, kindly, gentle, hard, cynical, lacking artistic 
aesthetically fastidious, frivo- feeling, responsible, self-suf- 
lous, attention-getting ficient 


Guilford used a different approach to the problem, administering sev- 
eral personality tests to large numbers of subjects, computing hundreds of 
intercorrelations between pairs of scores for the same individuals, and 
completing a factor analysis of the correlation matrix. As a result of this 
research, Guilford tentatively identified ten relatively independent personal- 
ity traits: 


G General activity: hurrying, liking for speed, liveliness, vitality, production, 
efficiency 

R Restraint: serious, deliberate, persistent, vs. carefree, impulsive, excite- 
ment-loving 

А Ascendance: self-defense, leadership, bluffing, speaking in public, vs. sub- 
missiveness and hesitation 


2 Raymond B. Cattell, Description and Measurement of Personality (New York: 
Harcourt, Brace & World, Inc., 1946). 


Personality Inventories and Projective Tests 299 


S  Sociability: many friends, seeking friends and social activities, seeking 
limelight vs. few friends, shyness 

E Emotional stability: evenness of moods, optimistic, composure, vs. fluc- 
tuation of moods, pessimism, daydreaming, excitability, feelings of guilt, 
worry, loneliness, and ill health 

O Objectivity: thick-skinned, accurate, observing, vs. hypersensitive, self- 
centered, suspicious, having ideas of reference 

F  Friendliness: tact, acceptance of domination, respect for others, vs. hos- 
tility, resentment, desire to dominate, and contempt for others 

T  Thoughtfulness: reflective, observing of self and others, mental poise, vs. 
interest in overt activity and mental disconcertedness 

P Personal relations: tolerance of people, faith in social institutions, vs. 
fault-finding, uncooperative, suspicious, self-pitying 

M Masculinity: interested in masculine activities, not easily disgusted, hard- 
boiled, inhibits emotional expression, little interest in clothes and styles, 
vs. easily disgusted, fearful, romantic, emotionally expressive, and dislike 


of vermin? 


. Some of the personality traits that had been hypothesized by the pioneers 
in personality testing, such as dominance-submission and emotional stabil- 
ity, were confirmed by these studies as independent personality traits. 
Others, such as extroversion-introversion, did not emerge as unitary traits. 


The Limitations and Advantages of the ‘Transparent Personality Inventory 

If an individual has sought guidance, use of a personality questionnaire 
with direct, transparent questions may be advisable. The early inventories 
Were quite transparent, while recently published inventories have incorpo- 
rated a number of devices to minimize the probability that the subject is 
answering the inventory so as to create a good impression. Actually, 
whether one uses a more transparent or more subtle approach depends 
largely on the situation, the attitude of the examinee, and the use to be made 
of the results. The fact that the more transparent personality inventories 
can be faked does not imply that they always аге. According to one re- 
Search study, if students are convinced that their replies will be kept con- 
fidential and used as a basis for helping them, they will usually give frank 
Tesponses to personality questionnaires." 

If a person voluntarily indicates symptonis of poor mental health through 


? Edward B. Greene, Measurements of Human Behavior, rev. ed. (New York: The 


Odyssey Press, Inc., 1952), pp. 636-638. 
+ Dora E. Damrin, “A Study of the 

Answer Personality Tests of the Questionna. 

chology, vol. 38 (April 1947), pp. 223-231. 


Truthfulness with Which High School Girls 
ire Туре,” Journal of Educational Psy- 


300 THE STUDY OF INDIVIDUALS 


his replies to direct questions, his replies are easy to interpret and coun- 
seling can proceed more eflectively. An individual who has sought help 
is usually motivated to answer truthfully; he has faced up to the fact that 
a frank discussion of his problems may be a necessary prelude to im- 
provement. | Es | | 

If а personality inventory is to be used as an aid in selection and classi- 
fication, the results from a transparent inventory may be useless, or even 
misleading. A more indirect approach must be used. By using subtle ques- 
tions, or by employing forced-choice questions, we may “trap” the indi- 
vidual into revealing more about himself than he intended to do. However, 
the more indirect the questions, the less confident we can be of our inter- 
pretation of his scores. Using indirect questions, or the forced-choice 
technique, makes an inventory more difficult to fake; hence inventories 
that are less transparent are preferred in prediction situations. However, 
as we gain in fake-resistance and predictive validity, we may lose in mean- 
ingfulness or construct validity. 

Rimland found that warning examinees that the inventories would be 
scored for truthfulness brought good results. That is, he compared the 
extent of faking under standard instructions and under instructions to warn 
students that a “Пе” score would be obtained. He found that informing 
examinees about validation scores reduced faking to a minimum.* 

The problem of “faking,” however, is not the only one we face in using 
personality inventories. Just as important as frankness is the student's 


insight into his own behavior—his ability to describe his own reactions 
without distortion. 


Adjustment is an emotional matter, something at which people cannot look 
in the light of pure reason alone. In contemplating their own adjustment, they 
are more likely to become biased, prejudiced, secretive, and deceitful of others 


and of self, than when contemplating their achievement in geometry, their 
physical health, or even their mental ability.9 


It has been found that the maladjusted individuals are the ones who аге 
most likely to distort their responses on personality inventories.” Hence, it 
is most difficult to obtain valid responses from those students for whom 


5 Bernard Rimland, The Development of a Test for Selecting Career Motivated 
NROTC Applicants. Bureau of Naval Personnel Technical Bulletin 57-58, 1957. 

6H. H. Remmers and N. L. Gage, Educational Measurement and Evaluation 
(New York: Harper & Row, Publishers, Inc., 1943), p. 338. 

тр, E. Vernon, “Review of Humm-Wadsworth Temperament Scale," The 1940 
Mental Measurements Yearbook (Highland Park, N.J.: The Gryphon Press, 1941), 
pp. 122-124. 


Personality Inventories and Projective Tests 301 


diagnosis is most important. Although there is little doubt that a student 
who makes a low score on a personality questionnaire should receive fur- 
ther study, there is no guarantee that high scores indicate good adjustment. 
The maladjusted student may be untruthful about his feelings and actions 
or may have built up defense mechanisms that obscure his insight into his 
own problems. 

Of course, the same factors that make it difficult to obtain frank and 
insightful responses on a personality questionnaire apply also to such tech- 
niques as interviewing student or parent and observing student behavior. 
As they grow older, people learn in varying degrees to conceal feelings 
and attitudes that are not socially approved. 


Attempts to Increase the Validity of Inventory Results 


The transparent inventory provides only a summary record of those 
Symptoms and self-criticisms the individual is willing to check. When an 
individual seeks guidance, а transparent inventory can facilitate the coun- 
seling process by indicating the problems he recognizes and may be willing 
to discuss, For use in vocational guidance, diagnosis of maladjustment, 
and personality research, however, the transparent inventory has proved 
quite unsatisfactory. 
, Ellis, after reviewing studies reported in the literature, concluded that 
'group-administered paper-and-pencil personality questionnaires are of 
dubious value in distinguishing between groups of adjusted and maladjusted 
individuals and that they are of much less value in the diagnosis of indi- 
vidual adjustment or personality traits.” 


Some of the more recently published inve У 
tures, such as the use of forced-choice questions" (which have made them 


more fake-resistant) and verification scores (which have helped test-users 
to identify inventories in which students have distorted their responses to 
Create a good impression). Even with these improvements, however, rou- 
tine use of personality inventories seems inadvisable. Adequate psychologi- 


ntories have incorporated fea- 


8 Albert Ellis, “The Validity of Personality Questionnaires,” Psychological Bulle- 


tin, vol. 43 (September 1946), p. 426. 


^ Forced-choice questions are of the type use! кесо 
Which the examinee is forced to choose between responses. In forced-choice inven- 


tories, the responses among which a student must choose are first matched with 
respect to their social desirability. In such an inventory, the examinee is prevented 
from choosing most of the socially desirable items and rejecting the socially unde- 
Sirable ones. For a more adequate explanation of the response tendency to answer 
questions in a socially desirable manner, see Allen L. Edwards, The Social-Desira- 
bility Variable in Personality Assessment and Research (New York: Holt, Rinehart 
and Winston, Inc., 1957). 


d in the Kuder Preference Record in 


302 THE STUDY OF INDIVIDUALS 


cal background and experience are required for drawing sound inferences 
from personality inventory results. Hence, such inventories are best used 
as one approach in the study of selected students by qualified psychologists 
and counselors.^ The cumulated research on the inadequate concurrent 
and construct validity of personality inventories has led to many attempts 
to improve their validity. 


DISGUISING INVENTORY ITEMS Some test authors have attempted to 
disguise their test items, anticipating, in a sense, the student's tendency to 
rationalize his behavior. The following examples of partially disguised, or 
subtle, questions are taken from the California Test of Personality: 


Are your tests so hard and unfair that it is right to cheat? 
Do your classmates quarrel with you a great deal? 
Do you suffer more than most people when you are ill? 


Wiener contends that partially disguised or subtle items are best for 
personality inventories used with normal subjects who frequently show a 
tendency to respond in a socially desirable way, while obvious items func- 
tion best for abnormal subjects, especially if the latter frankly acknowledge 
their need to report symptoms and gain help thereby. Cronbach'* has sug- 
gested that interest and personality inventories be scored separately for 
obvious and subtle items. 


OBSCURING THE SCORING PATTERN Authors have also tried to reduce 
the examinee's awareness of the traits or components being measured by 
arranging the items pertaining to each component (such as antisocial be- 
havior or nervous mannerisms) in random order throughout the test. When 
similar items are grouped together, the inventory becomes more trans- 
parent and the examinee can more readily respond to items so as to create 
any desired impression. Scattering items for each trait throughout the in- 
ventory complicates hand scoring but probably increases the validity of test 
scores, as compared with the procedure of grouping together all questions 
on antisocial behavior, all questions on nervous mannerisms, and the like. 


10 As will be explained later, a problems checklist can be administered to large 
numbers of students as an aid in discovering common problems for consideration in 
group guidance work or for identifying students who voluntarily admit problems 
they consciously face and on which they would like individual counseling. 

11 Daniel N. Wiener, “Subtle and Obvious Keys for the Minnesota Multiphasic 
Personality Inventory," Journal of Consulting Psychology, vol. 12 (May-June 1948), 
рр: 164-170. | 

12 Lee J. Cronbach, Essentials of Psychological Testing (New York: Harper & 
Row, Publishers, Inc., 1960), p. 458. 


Personality Inventories and Projective Tests 303 


USING VERIFICATION SCORES TO IDENTIFY INVALID INVENTORIES Con- 
siderable work has also been done on the development of sets of items that 
can be scattered through an inventory and scored so as to reveal test-taking 
attitudes, Verification or validation scores, as they are called, are especially 
needed if fairly transparent questions are being used in situations in which 
many examinees would feel on the defensive. The more threatened the 
examinee feels by the test situation and his own feelings of inadequacy, 
the more likely he is to make *good impression" or "socially desirable" 
responses. The more transparent the inventory, the easier it is for the 
examinee to “fake good" or present a facade. The California Psychological 
Inventory includes a number of items selected because examinees tend to 
respond differently to them under standard and "fake good" directions. 
These items constitute a “good impression scale." 

Several inventories include a number of questions on common faults 
and frailties to which the majority of persons make the unfavorable re- 
Sponse; for example, "Have you ever pretended to know something you 
did not know?" A person who answers a large number of these questions 
in the favorable direction has too high a “е” score to have his inventory 


considered valid. 
The Minnesota Multiphasic Personality Inventory (MMPI), a very ex- 
psychologists, has several 


tensively studied inventory used chiefly by clinical | i eve 
validation scores: a lie score, a K-score of test-taking defensiveness (similar 


toa good-impression scale), an F-score on deviant or rare responses, à 
question score (the total number of items to which the examinee responds 
With "cannot say"), and an inconsistency Score. When several validation 
Scores are available, they can be used in combination. For example, a рег- 
Son with high scores in both deviancy and inconsistency has probably filled 
Out the questionnaire very carelessly; while a person with a high deviancy 
Score and high consistency may actually show bizarre responses in real-life 
Situations, 


USING CORRECTION SCORES Most types of validation scores merely en- 
able the psychologist or counselor to discard inventories that are suspected 
of having been filled out carelessly, ог 10 create a good impression. How- 
ever, some validation scores, for example, the K-score on the MMPI, are 
used to correct for the tendency to “fake good” or respond defensively. 


If an examinee has a high K-score (which presumably reflects defensive- 
Ness in test-taking), his scores are corrected for this tendency. His corrected 


inventory is salvaged rather than discarded. 


s ON THE BASIS OF EMPIRICAL DATA An- 
validity is to select and key items 
g the typical responses of criterion 


SELECTING AND KEYING ITEM 
Other approach to increasing inventory 
On the basis of research findings regardin 


UDY OF INDIVIDUALS 
304 THE ST! A 


groups. The student will recall that the responses to items of Strong's 
interest inventories were assigned weights on different occupational keys 
in terms of the differential responses of occupational groups, rather than 
on any logical, a priori basis. The MMPI is an example of a personality 
inventory in which the items are empirically selected and keyed. Items that 
paranoid patients answered differently from normals were assigned to the 
paranoid key. Since paranoids tend to insist on their own ideas, the follow- 
ing question would logically be assigned to that key: "It takes a lot of 
argument to convince some people of the truth." However, since paranoids 
did not check this statement as true any more frequently than did normals, 
it is not included in that scale. Answering false to this question, however, 
was found to be associated with hysteria and is therefore keyed on the 
hysteria scale. 

Responses to questions were treated as symptoms or signs of a diagnostic 
category in the "concurrent validity" sense; they were not considered as à 
representative sampling of a defined area as in the “content validity" sense. 
As the reader can see, the empirical approach makes no assumptions re- 
garding the examinee's honesty or insight. Rather, it evaluates each item 
empirically in terms of its relationship to the criterion, for example, to 
membership in a diagnostic group or (for trait inventories for normal 
subjects) with ratings by close associates on the trait presumably being 
measured. The reader will recognize, however, that the empirical basis for 
selecting items is no royal road to the development of adequate personality 
inventories because it is so difficult to obtain adequate criterion data. 

On personality questionnaires designed to identify tendencies to abnor- 
mal behavior, the items are usually validated by comparing the responses 
of subjects in various diagnostic categories in mental institutions with those 
of so-called normal subjects. Those questions are included that tend to be 
answered differently by normal and abnormal subjects. The criterion for 
item validity is membership in a normal group or in a diagnosed group of 
patients from a mental hospital or clinic. Appropriate methods of develop- 
ing and validating inventories designed to have concurrent validity are pre- 
sented in Table 4.3. 

If a personality inventory is being used in the selection of employees," 
items will be selected and keyed in terms of their predictive validity for 
some such criterion as job turnover or job production (for example, 


13 "Standards of Ethical Behavior for Psychologists," American Psychologist, vol. 
13 (June 1958), pp. 266-272. When personality inventories are used as а basis 
for selection and classification, it is essential that the psychologist make sure that 
examinees understand the purpose for testing and the way in which the results will 
be used. The "Standards of Ethical Behavior for Psychologists" must be scrupu- 
lously observed. 


Personality Inventories and Projective Tests 305 


amount of insurance sold). Such predictive validity studies are best done in 
each company; job requirements, working conditions, and criteria of suc- 
cess vary enough that items that work best in one situation may not be the 
ones that have highest predictive validity in another. The item validities 
obtained in one group of employees need to be checked or cross-validated 
With another group; and after a period of years, restudy is again needed to 
see if the predictive value of items has changed. 

When attempts are made to measure personality traits of normal indi- 
viduals, such as dominance or sociability, questionnaire items are often 
selected by comparing the responses of students who are rated high by 
teachers or psychologists in the personality trait being measured with the 
responses of those who are rated low in the same trait. Correlation with 
ratings is a questionable method of validating items because the criterion 
ratings tend to be unreliable, to be affected by extraneous factors (such as 
the student's attitude toward school and the teacher), and to have no dem- 
onstrated construct validity of their own. 

Another basis for item selection is to compare the responses to individual 
questions of students who made high total scores in dominance, sociability, 
and the like, with the responses of students who make low total scores on 
these same traits on the preliminary edition of the test. This technique of 
item selection will make the subtests more homogeneous but ignores the 
problem of whether the subject’s verbal responses are related to his be- 
havior or other people’s impressions of him. No external criterion is in- 
volved. Methods of developing and validating tests to be used in trait 
description involve many approaches, summarized in Table 4.9. 


The Reliability of Personality Inventories 


Measuring test reliability by the test-retest method is inappropriate for 
in a student’s responses to ques- 


Personality inventories. The inconsistency ! e 5 
tions in a personality test may actually reflect an important aspect of his 
Personality. Cattell found, for example, that individuals who changed their 
Tesponses the most when tests of interests and attitudes were readministered 


after а few days tended to rate high on emotional instability, one of the 
most important source traits in personality. Because а test-retest reliability 
Coefficient would be affected by individual differences with respect to this 
trait, internal consistency methods are preferred in checking the reliability 
of personality inventories. The split-halves method and the Kuder-Richard- 
Son method, which reveal only the internal consistency of the test, are most 
frequently used. 

The reliability coefficients of pers 
erably lower than those for ability tests © 


onality inventories tend to be consid- 
f the same length. One reason is 


E STUDY OF INDIVIDUALS 
306 TH 


that many responses in the area of personal-social adjustment tend to be 
specific to the situation. A person who might show extrovertive behavior 
on a job in which he had high competence might be introvertive in social 
situations. Even within the realm of social situations, there will be incon- 
sistencies as the person is in small or large groups, with persons of the same 
or opposite sex, or with his peers as compared with those in a position of 
authority. 

In interpreting personality profiles, it is especially important to check 
the test manual concerning the reliability of subtests and the intercorrela- 
tions between them. When one interprets such data with the aid of Table 
3.8, one may find that the difference scores have such low reliability that 
only the largest differences can be used as a basis for making inferences 
about intraindividual differences for a student. 


Interpreting Results from Personality Inventories 


In the typical personality inventory, relevance has been sacrificed in 
some degree in order to achieve objectivity of scoring. For example, more 
valid and significant information would undoubtedly be obtained if, instead 
of asking “Do you daydream frequently?” one asked a student about his 
daydreams or the circumstances that led him to have “spells of the blues.” 
Such questions, however, would not ordinarily be included in а personality 
inventory, both because of the difficulty of obtaining frank replies and be- 
cause of the subjectivity that would be involved in scoring and interpretation. 

In a personality inventory, the extent of a student’s maladjustment is 
presumably indicated by the number of his unfavorable responses. The 
questionnaire does not reveal the intensity of feeling involved or the ap- 
propriateness of the behavior in terms of environmental factors. For exam- 
ple, if a student answers “Yes” in response to the question, “Do you get 
excited when things go wrong?" it would be necessary to obtain additional 
information (through observation or through interviews with the student 
and his parents) before appraising this response in terms of degree of 
maladjustment and type of assistance needed. Through the use of other 
techniques, one could find the answers to such questions as the following: 


How excited does he get? 

Is the degree of excitement unusual for a person of his age? 

Is his excitement disproportionate to the stimulus situation? 

Is his excitement so great that his behavior becomes disorganized? 

Is he overexcitable only when fatigued? 

About what things does he become excited? 

In what way does he manifest his excitement—in physical aggressiveness, in 
blaming others, in quiet tenseness, ог in tearful self-pity? 


Personality Inventories and Projective Tests 307 


In other words, personality questionnaires can aid in the first level of 
diagnosis—identifying those who need help. They can assist also in the 
second level of diagnosis—locating areas of difficulty—especially when re- 
sponses to individual items are studied. For the third level of diagnosis, 
however—identifying causative factors—it is especially important that the 
leads given by the personality inventories be used as a basis for further 
study of the student through the combined use of observation, student in- 
terview, parent conference, and other techniques presented in this and the 
succeeding chapter. 

Students’ replies to individual items of a personality questionnaire may 
provide leads for observation and for follow-up interviews. In fact, scan- 
ning replies to individual questions may prove to be more valuable than 
à routine analysis of test scores. If one suspects, for example, that a stu- 
dent's school adjustment difficulties are rooted in his family relationships, 
he may wish to examine the student's replies to those questions involving 


parent-child relationships, such as: 


Are you scolded for many little things that do not amount to much? 

Do you feel that you are bossed too much by your folks? 

Have things ever been so bad at home that you have had to run away? 

Do you wish that more affection were shown by more members of your 
family? 


Do your folks appear to doubt У 215 


hether you will be successful 


h groups of related questions may 


The study of student replies to suc 
ther study. In his conferences with 


provide extremely helpful leads for fur 
parents, however, the counselor should avoid any reference to specific 


Statements made by a student or to the inventory às a source of informa- 
tion, It is both unnecessary and unwise, for example, for a counselor to 
reveal that his interest in "problems you may have with John at home" 
grows out of John's low test score in “family relationships.” Nor should 
personality inventory scores be cited in discussing the student's problems 
With his parents. Students’ replies to items of a personality inventory should 
be considered confidential information. 

Unless personality inventories have been developed according to the 
methods recommended in our discussion of “construct validity” in Chap- 
ter 4, we are not justified in interpreting the personality inventory scores 
as representing a student's status with respect to underlying personality 
dimension or traits. We should, rather, view à student's personality inven- 


Chapter 14. 


14 The three levels of diagnosis are explained in | 
(Monterey, Calif.: California Test 


35 California Test of Personality, Secondary 
Bureau, 1953). 


STUDY OF INDIVIDUALS 
308 THE 


tory as а summary of the symptoms and self-criticisms he was willing to 
report on à specific questionnaire administered ona specific occasion. The 
student's results on another questionnaire with the same trait label may give 
quite dissimilar results. As Cronbach emphasizes 


Trait names . . . are a source of serious confusion in the personality field. 
The meaning of "introvert" . . . represents for one author a brooding neurotic, 
for another anyone who would rather be a clerk than a carnival barker. 
* Ascendance" ranges from spontaneous social responsiveness, in one theory, to 
inconsiderate and overbearing behavior in another. . . . In the present Babel 
of trait names, the only useful way to discuss personality test data is to speak 


of "Guilford's Ascendance score," "CPI Dominance score," according to the 
measure used.1* 


Although it is neither feasible nor desirable to discuss many of the pub- 
lished inventories that are now available, a few of the widely used inven- 


tories will be briefly discussed as illustrative of the major types of inventories 
available. 


INTERPRETATION OF THE СРТ, A TRANSPARENT INVENTORY Illustrative 
of a transparent inventory of personality "traits" is the California Person- 
ality Test (CPT), which has forms available for use at all grade levels 
and with adults. Although an attempt has been made to disguise questions, 
this inventory is transparent to the test-wise student. The fact that items 
of a type are grouped together for ease in scoring makes it evident that one 
is being asked about nervous symptoms, withdrawing tendencies, and the 
like. Hence, the use of the CPT should be restricted to situations in which 
we have every reason to believe that frank responses will be given. The 
reliability of difference scores should be noted in interpreting CPT profiles. 

Another problem that complicates the interpretation of transparent per- 
sonality inventories is that the well-adjusted person, especially the well- 
adjusted adult, is more likely to admit his faults than the person who is 
more insecure and has greater need to distort his self-appraisals. Loevinger' 
has pointed out that psychologically mature college students and adults 
may get lower scores than persons whose self-concepts are at a less mature, 
overconforming stage of development. One should probably be suspicious 
of the consistently high personality profile on a transparent inventory. Even 


16 Lee J. Cronbach, Essentials of Psychological Testing (New York: Harper & 
Row, Publishers, Inc., 1960), pp. 467—468. 

17Jane Loevinger, *A Theory of Test Response," Proceedings of the 1958 Invi- 
tational Conference on Problems of Testing (Princeton, N. J.: Educational Testing 
Service, 1959). 


Personality Inventories спа Projective Tests 309 


though the examinee is telling the truth as he sees it, he may have a special 


need to see himself as having all the characteristics expected of him by 
society. 


INTERPRETATION OF THE MMPI AND ITS ADAPTATIONS (EMPIRICALLY- 
KEYED INVENTORIES) Although the items of the MMPI inventory were 
originally selected and keyed in terms of their relationship to diagnostic 
categories (depression, hysteria, and the like), it is now recognized that the 
results are best interpreted in terms of cumulated research data on subjects 
with various coded profiles, that is, various combinations of high and low 
Scores on the original scales. The interpretation of the MMPI requires 
extensive psychological background, supervised experience in its use, and 
thorough familiarity with such aids as are listed in the chapter bibliography 
and with more recent research studies. Approximately one hundred new 
research studies on the MMPI appear each year. Even when the currently 
approved methods of interpretation are used by highly qualified psycholo- 


gists, the scores should be used as à basis for hypotheses to be checked 


by other methods, rather than conclusions. When used in this way, the 


MMPI seems to have considerable value for clinical psychologists. 

Two adaptations of the MMPI that are more suitable for use with normal 
Subjects are the California Psychological Inventory (CPI) and the Minne- 
Sota Counseling Inventory (MCI). Both tests have verification Scores. 
Many research studies have been completed on the relationship of scores 
on these inventories and such significant variables as underachievement, 
delinquency proneness, and others. Bibliographies of such studies can be 
Obtained from the publishers. ur 

On any personality inventories, one should study combinations of scores 
on different subtests, rather than single subtest scores. For example, Gough, 
author of the CPI, suggests that when a high score on Ai (achievement 
through independence) is accompanied by à high score on Ac (achieve- 
ment through conformity), the person is likely to be efficient, well organ- 
ized, and stable; whereas а high score in Ai accompanied by a low после 
in Ac tends to be found in people who are demanding and dominant." 


INTERPRETATION OF THE EPPS (^ FORCED-CHOICE INVENTORY) DE- 
SIGNED TO MEASURE CONSTRUCTS Another inventory on which consider- 
able research data are available is the Edwards Personal Preference 
Schedule. This inventory differs from those previously discussed in at least 
two ways: (1) its design grows out of psychological theory (that is, the 


18 Harrison C. Gough, Manual, California Psychological Inventory (Palo Alto, 


Calif.: Consulting Psychologists Press, 1957). 


310 THE STUDY OF INDIVIDUALS 


Murray? theory of psychological needs); and (2) it utilizes the forced- 
choice method, the alternative responses being matched with respect to 
their rated social desirability. The first of these two characteristics implies 
that this inventory must be evaluated in terms of its construct validity as a 
measure of hypothesized dimensions. The second characteristic indicates 
that the inventory is designed to be fake-resistant; that is, the examinee is 
unable consistently to choose socially desirable responses. 

Although this type of inventory has many advantages, it has a few dis- 
advantages: (1) the forced-choice method shows only intraindividual dif- 
ferences, not the relative strength of a trait with respect to other examinees; 
(2) unless students are highly motivated in self-appraisal, they tend to 
resist this type of test because they must make difficult choices and they feel 
frustrated in their attempt to paint a desirable self-picture; and (3) this 
type of test tends to have low reliability coefficients for subtests, partly 
because of the fact that the student experiences such uncertainty in choos- 
ing that he is likely to change many choices on retesting. The first charac- 
teristic was considered in the discussion of the Kuder inventories in the 
preceding chapter. The second problem can be met, in part, by interesting 
the student in self-appraisal. In connection with the third, we have to face 
the fact that the higher reliability of many of the other inventories with 
which we might compare the EPPS is attributable in large measure to the 
consistency with which the examinee can communicate the impression he 
wishes to make on a transparent inventory in which his choices are not 
limited by the preference or forced-choice approach. 


INTERPRETATION OF PROBLEMS CHECKLISTS A quite different type of 
personality questionnaire is the problems checklist. The items on such a 
list are intended to be a representative sampling of problems in different 
areas. No claims are made that the checklist measures personality traits 
or that its use can entice the student into revealing any problems he does 
not choose to report. 

An illustration of a problems checklist summarized by areas of adjust- 
ment is the Mooney Problem Check Lists?" one form of which can be 
used as early as the seventh grade. Student problems are classified under 
the following areas: Health and Physical Development, School, Home and 
Family, Boy-Girl Relations, Relations to People in General, and Self-cen- 
tered Concerns. 

Another example is the SRA Youth Inventory, developed for use in 


19 Henry A. Murray and others, Explorations in Personality (New York: Oxford 
University Press, 1938). 
20 Ross L. Mooney, Problem Check Lists (Columbus, O.: Ohio State University, 


1943). 


Personality Inventories and Projective Tests 311 


grades 7—12, but with separate profiles for the junior and senior high 
school levels. The leaflet on which the student draws his profile provides 
three and one-half pages of discussion designed to help him understand 
and use the inventory results. А description of the types of problems in 
each area, suggestions on how to get help in solving problems, and a con- 
crete example based on one student's experience are included. The prob- 
lems reported by students are summarized under eight areas, as follows: 
My School, Looking Ahead, About Myself, Getting Along with Others, 
My Home and Family, Boy Meets Girl, Health, and Things in General. 
Item norms for different grade and sex groups are given in the manual. 
In addition, a Basic Difficulty Key is supplied for use by the counselor in 
indicating problems that may be caused by more serious personality diffi- 
culties, The SRA Junior Inventory for grades 4—8 is also available. 

Тће more recently published Billett-Starr Youth Problems Inventory 
has separate forms for grades 7—9 and 10-12. Like the others, this inven- 
tory is designed for screening those students who need and wish individual 
counseling, as well as for identifying common problems that might be ap- 
proached through group guidance procedures. Totaling responses for area 
Scores is not recommended. 

Students who voluntarily admit on a problems checklist that they have 
problems they would like to discuss can be called in for counseling inter- 
views. Use of such a checklist helps the counseling staff to make effective 
use of the limited time available. One must recognize in interpreting prob- 
lem checklists, however, that only consciously felt problems that the stu- 
dent is willing to report will be checked. The results should be interpreted 
as suggesting areas worthy of exploration. Such checklists can provide a 
good starting point for individual or group guidance. Since there has been 
considerable parent criticism of widespread administration of personality 
inventories of any type, it is probably desirable that an advisory committee 
of representative parents assist in the planning of projects in which such 
checklists are routinely administered and in the interpretation of such 


projects to other parents. 


PROJECTIVE TECHNIQUES 


e tended to criticize the kinds of personality de- 


scriptions provided b ersonality inventories. These critics prefer ap- 
p A: individual as a whole and 


proaches to personality study that consider the 

that try to explain why people behave as they do, not merely what they 
are like, They contend that one cannot adequately describe complex hu- 
man personality by summating à series of trait scores. As Murphy says: 


Clinical psychologists hav 


TUDY OF INDIVIDUALS 
3192 THE S 


There are many different ways in which the simplest traits of the individual 
may be put together; some operate summatively, others subtractively. Some- 
times when one trait is present it acts almost like an enzyme, allowing the more 
effective utilization of another trait. 

Recently biologists . . . and other originators of "general system theory" 
have shown us that wherever living systems are involved, the problem of 
organization and emergence takes over and complicates the problem of showing 
how the individual trait reflects the system of which it is a part. It does not 
shock us today to be told that a patch of red in a landscape will look differently 
if the context is altered, But it still does seem to bother us . . . to be told that 
a trait is operationally different when it appears in different contexts. Coolness, 
for example, or the maintenance of a low level of affect, is a very different 
thing in a danger situation and in a social gathering.? 


Projective Techniques Used by Clinical Psychologists 
Anastasi has classified projective techniques into five major categories: 


1. Associative techniques, in which the individual responds to a stimulus by 
giving the first reaction that occurs to him. This approach, initiated by 
Galton and Jung, has been extensively used in screening subjects for psy- 
chiatric study. One of the most widely used lists of stimulus words is the 
Kent-Rosanof] Free Association Test. On this test, the psychologist can 
check the subject's responses to each stimulus word to see the frequency 
with which that response was given in a standardization group of 1,000 
normal adults. 

One of the oldest and most widely studied of projective techniques, The 
Rorschach Ink Blot Test, is classifiable in this first group as an associative 
technique. The test is administered individually. Although group forms are 
available, they are not considered to be comparable. In the administration 
of the Rorschach, the subject indicates orally what he thinks each ink blot 
might be, and the examiner records his responses. During an inquiry 
that follows the initial test, further responses are sought; and the psychologist 
"tests the limits" to see if the subject is capable of giving certain types of 
responses that he has not previously given. 

Psychologists differ widely in their appraisal of the value and limitations 
of the Rorschach. Although more than two thousand publications on the 
Rorschach are available, there are few well-designed studies concerning the 
validity of the assumptions involved in its interpretation, Several reviews of 
the test are given in the Fifth Mental Measurements Yearbook, together 
with an extensive bibliography of research studies. 


The recently developed Holtzman Ink Blot Technique claims to provide 


21 Gardner Murphy, “Concepts of Personality—Then and Now,” 1956 Invitational 
Conference on Testing Problems (Princeton, N.J.: Educational Testing Service, 
1957), рр. 43—44. 

22 Anne Anastasi, Psychological Testing (New York: The Macmillan Company, 
1961), р. 566. 


Personality Inventories and Projective Tests 313 


the advantages of the Rorschach in clinical work and yet overcome some 
of its disadvantages. The number of ink blots is much larger; instead of 
10 ink blots, there are 90 (organized into two parallel forms of 45 each). 
Since only one response is obtained for each ink blot, the total number of 
responses is comparable from subject to subject. Computer analysis of the 
responses of hundreds of subjects has produced a scoring guide that con- 
tributes to high interscorer reliability. Percentile norms on 22 response 
variables have been prepared for eight groups, ranging in age from five- 
year-olds to adults, and including defined clinical groups. Although the 
authors have reported in detail concerning the development and standardiza- 
tion of this test;?? years of evaluated experience in its use will be required 
before its contribution to the appraisal of personality can be properly 
assessed. 

Construction procedures, which require the subject to create or construct а 
product, such as a story. The tasks are usually introduced as a test of imagi- 
nation or creative ability, and interpretation of the results typically involves 
a content analysis of the story or other product. 


The best-known test of this type is the Thematic Apperception Test 


(TAT), which has been extensively used in both personality research and 


clinical work. The subject is asked to tell a story about each of several 
ambiguous pictures, selected because of their relationship to characteristic 
conflicts, personality needs, and environmental pressures. Although several 
quantitative scoring plans have been developed and used in personality re- 
search, clinical psychologists tend to interpret the content in terms of their 
own perception of recurrent underlying themes and in terms of other infor- 
mation they have about the subject. Н . 

3. Completion tasks, such as completing sentences ог incomplete reaction 
stories. The “incomplete sentence” technique has been used and studied for 

years. In 1950, a selected list was published in a more objective test form 

as the Rotter Incomplete Sentences Test. Another widely used list was 


prepared by Rohde in 1957.25 
Another completion task, widel 


N 


у used in personality research and clinical 


work, is the Rosenzweig Picture-Frustration Study, in which the subject 
reacts to a series of cartoonlike drawings. In each drawing two characters 
are involved in a mildly frustrating situation of a type that frequently occurs 
in everyday life. The subject is asked to indicate what the frustrated person 
is probably saying. On the assumption that the subject identifies with the 
frustrated individuals, his responses are classified. according to whether they 
represent (1) “extrapunitive” responses (aggression directed outward), (2) 
"intropunitive" responses (aggression turned in upon the subject himself), 
or (3) "impunitive" (attempts at glossing over or evading the situation). 
Only moderate agreement among scorers on the classification of subjects 


verbal responses has been obtained. Е" 
4. Choice or ordering devices, calling for the rearrangement of stimuli, the 


? Wayne H. Holtzman, and others, Inkblot Perception and Personality (New 


York: The Psychological Corporation, 1961). 
247, B. Rotter and Janet E. Rafferty, Manual for the Rotter Incomplete Sentences 


Blank (New York: The Psychological Corporation, 1950). 
25 Amanda R. Rohde, The Sentence Completion Method (New York: The Ronald 


Press, 1957). 


314 THE STUDY OF INDIVIDUALS 


recording of preferences, and the like. Since these tasks require fairly simple 
responses from the subject, the scoring can be entirely objective, although 
it may be time-consuming. One of the oldest and best-known tests of this 
type is the Szondi Test." Photographs of patients with various types of 
mental illness are presented to the subject, who indicates which photographs 
he prefers. It is assumed that the subject will tend to choose photographs 
of patients with tendencies similar to his own. Validity studies have been 
disappointing. The Tomkins-Horn Picture Arrangement Test (PAT) re- 
quires that the subject react to each series of three pictures, arrange the 
three in the order “which makes the best sense” to him, and write a sentence 
for each of the three pictures so as to tell the story he has in mind. The 
test can be administered in groups and objectively scored. Temporal sta- 
bility of scores is low, and adequate validity studies have not been com- 


pleted. Like many personality tests, the PAT requires much more research 
to aid in meaningful interpretation of scores. 


5. Expressive methods, which differ from construction procedures in that the 
individual’s style or method is evaluated as well as his product. Almost 
every technique and type of subject-matter have been used. One of the best- 
known examples is the Draw-a-Person Test by Machover.? The subject is 
asked to "draw a person"; while he draws, his sequence in drawing and time 
used are recorded, as well as his comments and questions. When he com- 
pletes the drawing, he is asked to draw a person of the opposite sex from 
the one he chose for his first picture. The results are interpreted qualitatively, 
а composite personality description being prepared by the examiner from 
an analysis of its special characteristics. 

The use of puppets, dolls, and miniature objects in original dramatiza- 
tions would also be included under this heading. Techniques of play therapy 
have been adapted for use in projective testing, the examiner noting the 
objects the child selects, how he uses them, his verbalizations, and other 
behavior as he acts out his feelings. The Driscoll Play Kit, for example, 
illustrates the type of materials available to clinical psychologists. The kit 
opens to form an apartment inhabited by five plastic dolls with movable 
joints (designed to represent mother, father, brother, sister, and baby). 


Projective techniques tend to afford wide bandwidth and low fidelity. 
Hence, they are best used as exploratory techniques in the early phases of 
clinical study. They are best interpreted in combination with data from 
other sources that can serve to confirm or question hypotheses developed. 
They have the advantage of being highly interesting to most subjects and 
highly fake-resistant. Some of them can be used with children or adults 
who have difficulty in communicating verbally. Some involve nonverbal 
communication, which is not subject to as much restraint or censorship as 
verbal communication. Those that do depend on verbal communication, 
such as the picture-story tests, utilize materials and are administered in a 


26 Susan K. Deri, The Szondi Test (New York: Grune and Stratton, 1949). 
27 Karen Machover, Personality Projection (Springfield, Ill.: Charles C Thomas, 
1949). 


Personality Inventories and Projective Tests 315 


setting that encourages the subject to reveal fantasy material which he 
might ordinarily suppress. 

Separate volumes have been written about each of the various techniques 
listed above. Courses in projective techniques, which require prerequisite 
education in clinical psychology, must be supplemented by supervised ex- 
perience in the administering, scoring, and interpretation of these techniques. 


TEACHER USE OF ADAPTATIONS OF PROJECTIVE TECHNIQUES Of the 
techniques classifiable under projective techniques, teachers and counselors 
may be able to use to advantage: reaction stories, open questions, incom- 
plete sentences, and study of students’ creative writing and art products. 

In using open questions or incomplete sentences, teachers or counselors 
may use them individually with students on whom case studies are being 
made in order to understand their problems. Administered in privacy to a 
Student with whom rapport has been established, they may provide leads 
for further study. If open questions or incomplete sentences are to be ad- 
ministered to a class, they should be of a more impersonal type (for exam- 


ple, “My most difficult subject . . . ,” “My favorite recreation . . . ,") ог 
should allow the individual considerable latitude in his choice of subjects 
.,” or “If I had three wishes, . . „је 


(for example, “When I grow up . . ; 
Because of the understandable reluctance of students to make negative 


statements in a classroom situation, it may be advisable to phrase most 
items positively, for example, “What I like about my home,” or *What 
others have said they like about me." 

Teachers need to be especially careful in making inferences from stu- 
dents' creative writing and their art products. Unless a theme utilized ina 
Student's story is repeated in the same or similar form in other writings, 1t 
may represent only a plot from a television drama recently seen or a story 
Tecently read. : . 

Although psychologists are agreed that children express their emotions 
and conflicts through art, they аге not ready to present а list of prin- 
ciples that teachers can use with confidence in the interpretation of chil- 
dren’s art work. The art work of all children is not equally revealing. Some 
children, especially in the middle and upper grades of elementary school, 
become so engrossed in the problem of representing objects and people 
that their work is no longer an emotional outlet for them.” 

Just as the teacher must not attach too much diagnostic significance 
toa Single story, she must be similarly cautious about the interpretation of 


Dale B. Harris, “Studies in the Psychology of 


28 Florence І. Goodenough and j 
Children’s Drawings, 1928-1949,” Psychological Bulletin, vol. 47 (September 1950), 


Pp. 369-433. 
29 Ibid., p. 10. 


316 THE STUDY OF INDIVIDUALS 


isolated paintings. А series of paintings will reveal a child's characteristic 
style, colors most frequently used, recurring content, maturity in repre- 
sentation, characteristic distortions or omissions in the human figure, 
and the like. 

Since expression of emotions through art is so highly individualized, 
it is obvious that norms for interpretation are difficult to develop. The 
significance of certain colors and techniques varies with the maturity of 
the child. For example, Alschuler and Hattwick found that consistent 


emphasis on cold colors was not typical of the happy nursery-school 
child, and that children of these ages who consistently favored cold colors 
showed overcontrolled behavior. Among older children, however, prefer- 
ence for the cooler colors was no longer indicative of poor adjustment.*° 


SUMMARY STATEMENT 


Personality inventories and projective techniques are alike in that their use 
by unqualified persons is hazardous; they are widely different in that they 
represent the two extremes of the psychometric vs. the clinical approach to the 
study of personality. Moreover, personality inventories are structured self- 
report questionnaires designed to minimize ambiguity; while projective tech- 
niques involve observing the examinees’ behavior in situations that are designed 
to be unstructured or ambiguous and in which examinees are encouraged to 
react in a highly individual manner. In general, psychologists who make con- 
siderable use of one of these approaches tend to distrust the other. 

As psychologists have attempted to develop inventories that would provide 
personality profiles for individuals, each author has tended to design and score 
his inventories on the basis of his own list of personality traits, More recently, 
the techniques of factor analysis have been applied in an attempt to identify 
independent personality traits. The results of research studies by Cattell and 
Guilford were found to show considerable agreement. 

The chief type of personality test used in schools is the personality inventory 
or questionnaire, in which students answer a series of questions concerning 
their attitudes, feelings, and behavior. The validity of personality-inventory 
results depends not only on the care with which the questions are selected 
but also on the ability and willingness of students to give truthful responses. 
Attempts to check on the reliability of personality inventories and other ap- 
praisal techniques are complicated by the fact that the very inconsistency in 
a student's response may actually reflect an attribute of his personality. 

Some personality inventories are quite transparent; the examinee can easily 
discern the traits the test author is trying to measure. Other inventories have 
been carefully designed to disguise the dimensions being measured, and to 
thwart the efforts of the examinee who wishes to create a desirable impression. 
The relative merits and limitations of the transparent inventory are considered, 


30 Rose H. Alschuler and L. W. Hattwick, Painting and Personality, A Study of 
Young Children (Chicago: University of Chicago Press, 1947), p. 17. 


Personality Inventories and Projective Tests 317 


as well as the many techniques now being used in an attempt to increase the 
validity of inventory results. 

Е" the section оп interpretation of personality inventories, separate con- 
Sideration was given to the interpretation of results from (1) a transparent 
inventory, (2) an empirically keyed inventory, (3) a forced-choice inventory 
designed to measure constructs, and (4) problem checklists. Careful considera- 
tion of this material will help the reader to realize that the best inventory for 
one purpose will be quite unsatisfactory for another, and that each type has 
to be interpreted in terms of the type of inferences justified by the approach 
used in test construction, as well as the situation in which the test was ad- 
ministered (or the examinee's perception of that situation). 

Projective tests may encourage individuals to give their spontaneous reactions 
to ambiguous stimuli such as ink blots or to create individualized products such 
as stories about pictures involving conflict situations, which permit of a variety 
of interpretations. In some tests the individual is asked to complete unfinished 
sentences or stories; in others he indicates his preferences, for example, by 
arranging stimuli in a preferred order. Still other tests involve an evaluation of 
the person's style of response, as well as his product; for example, the examiner 
notes the sequence of movements, timing, comments and the like as the 
examinee draws a person ог carries out an original dramatization. АП of these 
techniques provide data that can serve as the basis for hypotheses about the 
individual that must be checked with the data obtained from other sources. 
Courses in the use of projective techniques require prerequisite training in 
Clinical psychology; formal instruction must be supplemented by supervised 
experience in the administration, scoring, and interpretation of projective tests. 


SELECTED REFERENCES 


ALLEN, R. M., Personality Assessment Procedures: Psychometric, Projective, and 


Other Approaches. New York: Harper & Row, Publishers, Inc., 1958. Е 
CAMPBELL, DONALD т., “А Typology of Tests, Projective and Otherwise, 
Journal of Consulting Psychology, vol. 21 (June 1957), pp. 207-210. 
CRONBACH, LEE J., “Response Sets and Test Validity, Educational and Psy- 
chological Measurement, vol. 6 (Winter 1946), рр. 475—494. Á А 
— —, “Further Evidence on Response Sets and Test Design," Educational 
and Psychological Measurement, vol. 10 (Spring 1950), pp. 3-31. 
DIAMOND, SOLOMON, “The Factorial Approach,” Personality and Temperament. 
New York: Harper & Row, Publishers, Inc., 1957, рр. 151-183. 
EDWARDS, ALLEN L., The Social Desirability Variable in Personality Research. 
New York: Holt, Rinehart and Winston, Inc., 1957. x 
FRICKE, BENNO G., “Subtle and Obvious Test Items and Response Set,” Journal 
of Consulting Psychology, vol. 21 (June 1957), PP- 250-252. Е 
HANLEY, CHARLES, "Social Desirability and Response Bias in the MMPI, 
Journal of Consulting Psychology, vol. 25 eee di рр, и 
HENRY, w „ “Projective Techniques," in Pau , ed., Handboo 
TEM E i 1. New York: John Wiley and 


of Research Methods in Child Developmen 


Sons, 1960. Р " 
JACKSON, DOUGLAS N., AND SAMUEL MESSICK, *Content and Style in Personality 
1. 55 (July 1958), pp. 243-252. 


Assessment," Psychological Bulletin, VO 


318 THE STUDY OF INDIVIDUALS 


LINDZEY, GARDNER, "On the Classification of Projective Techniques," Psycho- 
logical Bulletin, vol. 56 (March 1959), pp. 158-168. | 
MASLING, JOSEPH, "The Influence of Situational and Interpersonal Variables 
in Projective Testing," Psychological Bulletin, vol. 57 (January 1960), 
pP- 67-85. : : 7 

MESSICK, SAMUEL, Measurement in Personality and Cognition. New York: John 
Wiley and Sons, Inc., 1962. 

SUPER, DONALD E., AND JOHN О. CRITES, Appraising Vocational Fitness. New 
York: Harper & Row, Publishers, Inc., 1962, Chapter 19. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. Describe and evaluate one of the leading personality inventories used at 
the high school level. Study both the inventory and the manual, and consult 
the reviews in Buros’ Mental Measurements Yearbooks. 

2. Describe and evaluate a problems inventory for the high school level. 
Study the inventory and the manual, and consult the reviews in the Buros’ 
Mental Measurements Yearbooks. 

3. Compare the personality inventory and the problems inventory (mentioned 
in problems 1 and 2 above) with respect to method of construction, criteria 
used in validation, and purposes that they are designed to serve. 

4. Select three tests of personality for children of the same age range and 
compare them on the basis of content. Describe what each test purports to 
measure, 

5. Under what circumstances can transparent personality inventories be used 
to advantage? In which types of situations should their use be avoided? 

6. Why should most projective techniques be used only as part of a case 
study by a qualified psychologist? 

7. What cautions should be observed in the interpretation of children’s art 
work? 


PART THREE 
mm 
The Improvement 


of Instruction 


Development, Try-out, 
10 and Revision of 
Teacher-Made Tests 


As teachers have achieved a more significant role in planning the educa- 
tional experiences for their classes, they have also become responsible for 
appraising the extent to which students are progressing toward the goals 
of the educational program. Even in those schools where standardized 
achievement tests are regularly administered, they are usually given only 
once a year; and they measure only а fraction of the educational out- 
comes. It is the teacher who is responsible for measuring student achieve- 
ment day by day and week by week. He must develop his own tests for 
measuring student progress toward the immediate objectives of instruction. 


IMPORTANCE OF TEACHER-MADE TESTS 


It is through his own tests that the teacher communicates to students in- 
formation concerning the knowledges and the intellectual skills that he 
considers most important. Tests provide students with tangible indications 
of the outcomes expected from a course, even to a greater degree than do 
the textbook or syllabus. | 

Teacher-made tests and other types of teacher evaluation constitute the 
basis for grading students and reporting to parents. It is largely teacher- 
made tests that provide students with confirmation or “feedback” con- 
cerning the effectiveness of their efforts to learn. The knowledge that a test 
is to be given provides most students with strong stimulation to study; the 
types of test questions used in previous tests direct their efforts to learning 
activities that they believe to be most helpful in improving their test per- 
formance. Teacher-made tests have great potentialities for enriching or 
limiting the students’ self-directed study. ТЕ teacher-made tests faithfully 


321 


322 THE IMPROVEMENT OF INSTRUCTION 


represent the major objectives of instruction, special studying or reviewing 
for tests will renforce other aspects of teaching. 

In the process of planning teacher-made tests and devising items for 
them, the teacher faces up to such questions as: (1) What kinds of student 
behavior would be evidence of progress toward each objective? (2) What 
type of test situation and what specific items would elicit from students 
this type of behavior? (3) How should such behavior be rated or scored? 
(4) Is this objective realistic? If students have not shown progress, do I 
know how to reorient or modify instruction so as to obtain better results? 
When teachers try to devise test items that will help them in judging stu- 
dent progress toward a major educational goal, they begin to see more 
clearly what the goal really means and how difficult it is to determine 
whether students are really making progress toward ultimate objectives. 


CHARACTERISTICS OF A GOOD TEACHER-MADE TEST 


The characteristics of a satisfactory measuring instrument, as outlined in 
Part One, are just as applicable to teacher-made tests as to standardized 
tests. Specific application of these criteria, however, depends on the pur- 
pose for which a test is to be used. Some teacher-made tests are used to 
measure the relative status of students in some aspect of achievement (the 
scores serving as a basis for assigning marks or otherwise ranking students). 
Others are designed chiefly to serve certain instructional purposes— 
identifying facts and processes that require reteaching, helping students to 
recognize gaps in their own achievement, and the like. 


Tests Designed to Measure Relative Status 


If a teacher is developing a test to aid in ranking students with respect 
to achievement in geography or some other area, the test should: 


1. be based on a representative sampling of the content studied. The percentage 
of items on each topic should correspond approximately with the propor- 
tional emphasis given that topic in the course. In a geography test, for 
example, the teacher should consider the emphasis given in the course to 
each of the major geographic regions, as well as the relative stress placed 
on such topics as products, important cities, weather, soil conditions, trade 
routes, and the like. 

2. be based on a representative sampling of the abilities or skills emphasized 
in the course. Memorization of facts may be the only skill involved in a 
test unless the teacher makes a special effort to include questions that re- 
quire students to make comparisons, apply principles, read maps, and the 
like. 


Teacher-Made Tests 323 


3. contain a sufficient number of questions so that the test will have adequate 
reliability. Although no exact rule can be given regarding number of items, 
a test used as a basis for grading should probably have a reliability соећ- 
cient of .70 or better. Of course, а teacher can achieve the desired re- 
liability by combining scores on a series of short tests (that is, recording 
and combining the raw scores, rather than assigning an A, B, or C grade 
to the student's scores on each short test). 

4. include items covering a wide range of difficulty. The test items should range 
from easy to difficult, with a large number of items geared to the middle 
group and with several items difficult enough to challenge the best student. 
(Several easy items are also needed, but they will be all too prevalent with- 


out the teacher's trying to design them!) 


Instructional Tests 


If the teacher is using his test for group diagnosis, student self-evaluation, 
or other instructional purposes, and will not use the scores as a basis for 
ranking students, he need not be so concerned with the criteria listed 
above. For example, he may wish to include a disproportionate number 
of questions on a specific topic or geographic region because of its com- 
plexity or because he thinks that reteaching of certain aspects may be 
needed. This disproportionate emphasis would constitute à violation of 
criterion 1 and would distort the test scores 25 a basis for grading. The 
teacher may wish to emphasize certain skills to the exclusion of others (a 
violation of criterion 2). He may develop a short test that would be 
unreliable as a basis for judging the achievement of individual students 
on a topic but which would serve to show what facts or concepts had been 
learned by the group as а whole. Or he may use а short test to help 
Students clarify the more important learnings from a motion picture or 
excursion (see criterion 3). In a test on minimum essentials, the teacher 
would not be much concerned with range of difficulty of items; in fact, he 
might wish to include a large number of items that he hoped almost every- 
one had learned in order to assure himself that all students had mastered 
these basic concepts and to identify the few who had not (see criterion 4). 
That is, an instructional test that was not designed to rank students on the 
basis of their total scores might violate one or more of the criteria listed 


above and still be a valuable classroom test. . 
Superficially, it might seem that the instructional test does not need to 
à however, should meet the 


meet any standards. Every evaluation device, 10\ 
basic criteria of validity, reliability, and usability in terms of the purpose 


? Although higher reliability is desirable, it is seldom achieved in teacher-made 
tests. Increased reliability can be achieved by combining scores from several tests 
given during the semester or year. Table 3.5 and the quick method of estimating 
SD (Table 2.1) can be used to estimate reliability coefficients quickly. 


324 THE IMPROVEMENT OF INSTRUCTION 


for which it is being used. For example, any test designed to help 
diagnose certain types of errors in arithmetic should contain several prob- 
lems of each type. Thus, student scores can be related to types of weak- 
nesses, rather than representing chance errors. All tests should be con- 
cerned with major outcomes rather than trivialities. All tests should be so 
skillfully constructed and scored that extraneous factors do not invalidate 
scores. Questions should not be ambiguous; the intent of each question 
should be clear to all students who are prepared for the test. Catch ques- 
tions, "textbook language," and stereotyped verbalizations should be 
avoided. 

Brownell has developed several criteria for judging the worth of class- 
room tests in relation to the instructional process: 


Does the test elicit from the pupils the desired types of mental processes? 


A. test is good to the extent that it calls forth the mental processes which 
should be measured. . . . It is not enough to be able to recall facts; the facts 
learned must function . . . we must deliberately set out first to teach the desired 
mental processes, and then measure the degree to which instruction has devel- 
oped these processes. 


Does the test encourage the development of desirable study habits? 

The type of test given determines the "set" adopted by the pupil in his study. 
If this set is made habitual by reason of the teacher's more or less exclusive Use 
of a particular type of test, the pupil builds his study habits accordingly and 
neglects other valuable methods of attack upon subject matter. 


Does the test lead to improved instructional practice? 


. . . just as the student adapts his study procedures to the type of test which 


has been announced, so the teacher adapts her instruction to fit the type of 
test she intends to give. 


Does the test foster wholesome relationships between teacher and pupils? 

No one who has observed children under test conditions can doubt the 
possible effects of such situations upon their mental health . . . tests, which 
are pleasurable experiences to young children, become events to be dreaded 
and avoided by older children. . . . If a test in any way impairs healthful, 
wholesome relationships and growth toward integrated personality, that test 
is bad.* 


2 William A. Brownell, “Some Neglected Criteria for Evaluating Classroom Tests,” 
Appraising the Elementary School Program, 16th Yearbook, Department of Ele- 
mentary School Principals (Washington, D.C.: Copyright 1937, National Education 
Association) pp. 485-492. 


Teacher-Made Tests 325 


In these criteria, Brownell emphasizes the close relationship between 
evaluation and instruction. He recognizes that a teacher's chief concern, 
as he develops his own tests, should be that the evaluation advance the 
quality of instruction. If the teacher's tests stimulate his students to study 
relationships and apply principles, if they encourage the development of 
desirable study habits, and if they are accepted by students as both fair 
and helpful, testing can contribute immeasurably to the effectiveness of 
instruction. The teacher's discussion of test results when they are returned 
may determine whether testing facilitates or impedes the achievement of 
educational goals.? 

. АЦ tests should avoid overemphasis on isolated facts, as opposed to 
ideas and concepts that have more general application. Hawkes and others 
Suggest emphasizing questions involving *why, wherefore, how, with what 
Tesults, of what significance, explain, interpret, and compare," as opposed 
to “who, what, when, describe, and name.”* 

Information is important; but facts, terms, and rules are best learned 
When they are meaningfully interrelated to important concepts and prin- 
Ciples that have meaning to the student, which he can state in his own 
Words and apply to new situations. Certainly in daily quizzes or weekly 
tests, the examination of students’ recall of specific facts and terms is 
appropriate. Even here, however, items that test the students’ compre- 
hension, rather than just his memory, encourage а type of studying that 
leads to longer retention and greater probability of transfer. 

When major units of work are completed, the examination should cer- 
tainly require students to demonstrate their ability to apply concepts and 
Principles in test situations that are somewhat unfamiliar, that is, that do 


not exactly parallel the textbook problems. 


PLANNING TESTS FOR GREATER CONTENT VALIDITY 


f relating evalu- 
bjectives 
aluation 


Dressel has devoted many years to Jeadership in the field о 
ation and instruction, showing teachers how carefully formulated о 
Сап serve as bases for developing both learning experiences and ev: 
Procedures, The following listing of parallel elements in these two processes 


clarifies how closely they should be interrelated: 


" ЗА helpful article on teacher discussion of achievement fextvresults salen ols 
ennon, “Testing: Bond or Barrier between Pupil and Teacher,” Test Service Bulle- 


tin No. 82 (New York: Harcourt, Brace & World, Inc., n.d.). 


* Herbert E Hawk i i d C. R. Mann, eds., The Construction 
. Hawkes, E. F. Lindquist, an 
and Use of Achievement Examinations: А Manual for Secondary School Teachers. 


(Boston: Houghton Mifflin Company, 1936), p. 111. 


326 


INSTRUCTION 


1. Instruction is effective as it leads 


to desired changes in students. 


. New behavior patterns are best 
learned by students when the in- 
adequacy of present behavior is 
understood and the significance of 
the new behavior patterns thereby 
made clear. 


. New behavior patterns can be 
more efficiently developed by 
teachers who know the existing 
behavior patterns of individual stu- 
dents and the reasons for them. 


. Learning is encouraged by prob- 
lems and activities that require 
thought and/or action by each 
individual student. 


. Activities that provide the basis for 
the teaching and learning of speci- 
fied behavior are also the most 
suitable activities for evoking and 
evaluating the adequacy of that 


THE IMPROVEMENT OF INSTRUCTION 


EVALUATION 


. Evaluation is effective as it pro- 


vides evidence of the extent of the 
changes in students. 


. Evaluation is most conducive to 


learning when it provides for and 
encourages self-evaluation. 


. Evaluation is conducive to good 


instruction when it reveals major 
types of inadequate behavior and 
the contributory causes. 


. Evaluation is most significant in 


learning when it permits and en- 
courages the exercise of individual 
initiative. 


. Activities or exercises developed 


for the purposes of evaluating 
specified behavior are also useful 
for the teaching and learning of 
that behavior.’ 


behavior. 


Ideally a teacher who is designing a comprehensive examination should 
define the universe to be sampled in much the same way as the author of 
a standardized test. That is, he should decide the proportional emphasis 
that should be given to different content areas so as to represent his instruc- 
tion fairly; and he should also decide the types of abilities to be sampled 
by the test items (for example, recognition or recall of learned material; 
comprehension as shown by some type of interpretation, use of Jearnings 
in new situations, and the like). That is, his plan for the test should con- 
sider the relative emphasis to be given both to content areas and tO 
processes Or cognitive abilities (specific ways of responding to, or dealing 
with, the course content). 


5 P. L. Dressel, “Evaluation as Instruction," Proceedings of the 1953 Invitational 
Conference on Testing Problems (Princeton, N. J.: Educational Testing Service, 


1954). 


Teacher-Made Tests 327 


An Illustrative Table of Specifications 


If the teacher accumulates test items without a plan, they will unduly 
represent informational learnings, especially knowledge of specific facts. 
Moreover, teachers are likely to overemphasize certain areas of content 
in which items are easily constructed. In order to improve the test's repre- 
sentativeness, or its content validity, one should first develop a blueprint 
for the test. 


Table 10.1 
Specifications for a Final Examination in Natural Science—Term 3 


eee 


Comprehension 
(Translation, : 
Knowl- Interpretation, Appli- . 
Objectives* edge Extrapolation) cation Analysis Total 


en 
COURSE CONTENT 
I. The number m 
concept 6 4 
П. Fundamental 
concepts of 


i arithmetic 4 ý ? a 
II. Quantitative 
descriptions 2 : 2 3 0 
IV. The gas laws 3 в 
У. “Тһе 5ргїпр 
of the air" 1 P | | М 
VI. The kinetic 
theory of 
10 
i matter 4 М | 
I. The theory of 
the atom 1 , ‘ | в 
УШ, Electricity and 6 8 " 
combustion 1 
IX. Static 
electricity and 3 Р Е 
magnets 2 
X. Electricity and 
the nature of 3 > = 
matter 1 А 
"LL AMEN NEUE РР ЗЕ. = 
Total 25 a E = -- 


— ————————— 


ducational Objectives, Handbook I: Cog- 


“Based on Benjamin S. Bloom, ed., Taxonomy of Ё 
1956). 


nitive Domain (New York: David McKay Company, Inc. 
| L. Dressel and Associates 


Source: Used with the permission of the publisher from Pau e 
Р. р 


Evaluation in Higher Education (Boston: Houghton Mifflin Company, 1961), 


328 THE IMPROVEMENT OF INSTRUCTION 


In Chapter 4, in our discussion of content validity, we presented an 
illustrative test blueprint. In Table 10.1, we present a blueprint, or table 
of specifications, for a science examination, in which the items are classi- 
fied in terms of both content and objectives. In the left-hand column are 
listed the ten content areas to be represented by test items. Across the 
top are listed four types of educational goals that constitute the first four 
major categories of the taxonomy of educational objectives. The teacher 
designing this final examination has decided that 25 percent of all test 
items should test progress toward knowledge goals; 20 percent, compre- 
hension; 30 percent, application; and 25 percent, analysis. No attempt was 
made to attain this proportion of items in each content category, but rather 
in the test as a whole. 


Content-Orientation vs. Goal-Orientation 


This two-way table of specifications gives adequate consideration to 
both the content and cognitive abilities of the course. It is important that 
test coverage be adequate from both points of view. 

The history of work on educational objectives has been an interesting 
one, and one that has had great implications for evaluation. In the first 
wave of enthusiasm for behaviorist psychology, just following World War 
I, objectives were stated in highly specific and utilitarian terms (the ability 
to multiply specific number combinations, to spell specific words, to recall 
the symbols for specific elements, and the like). Within a decade, this 
approach to educational objectives fell into disuse, partially because of 
the tremendous volume of specifics and partially because new studies in 
transfer of training had revealed that the learning of general principles 
and the development of generalized behaviors resulted in greater econ- 
omy of learning, and greater interest and retention, than the learning of 
isolated specifics. 

When the importance of learning for transfer was recognized, lists of 
highly generalized objectives were then developed by national associations 
in the various subject fields. The following list for social studies is ап 
example: 


1. Acquisition of important information 

2. Familiarity with technical vocabulary 

3. Familiarity with dependable sources of information on current social 
issues 

4. Immunity to malicious propaganda 

5. Facility in interpreting social science data 

6. Facility in applying significant facts and principles to social problems of 


daily life 


Teacher-Made Tests 329 


7. Skill in investigating social science problems 

8. Interest in reading about social problems and in discussing them 

9. Sensitivity to current social problems 

10. Interest in human welfare 

11. The habit of working cooperatively with others 

12. The habit of collecting and considering appropriate evidence before 
making important social decisions 


Since that time some educators have felt that teachers committed to major 
Objectives need only the freedom to plan instructional activities, keeping 
these major objectives in mind; while others have realized that a tremen- 
dous amount of study and evaluated experience is needed to discover 
activities and materials that will help students achieve intermediate objec- 
tives related to these ultimate objectives. ШЫР | 

As instructional programs have become more diversified, published 
tests with a subject-matter orientation have come to fit fewer and fewer 
Courses, Test publishers have shifted more to the measurement of de- 
Veloped intellectual abilities and/or student progress toward the ultimate 
goals of education. 

It is essential, however, that students be fairly graded on the knowledge 
Objectives of a specific course; that the adequacy of their learning of 
specifics be measured so that correct learnings can be reinforced and gaps 
in achievement be identified. Teacher-made tests (developed by one 
teacher or by several teachers working cooperatively) must bear much of 
the burden of measuring student learnings in knowledge. The specific facts 
taught to clarify basic principles can vary to some extent from one locality 
{0 another. It is appropriate, therefore, that teacher-made tests, rather 
than standardized tests, bear much of the burden of testing students for 
knowledge of specifics. , 

The controversy between content-oriented and goal-oriented teachers 
Tegarding the importance of teaching and testing knowledges might ap- 
Proach a constructive solution if both groups would recognize the fallacies 
9f either extreme point of view. Content-oriented teachers can become so 
Preoccupied with the importance of students’ learning specific facts and 
terms that their students fail to gain a sense of direction and a level of 
Understanding, which comes from seeing the in 
Principles and applying them to unfamiliar problems. On the other hand, 
goal-oriented teachers who focus their attention on long-range objectives 


Тау fail to give adequate attention to the specific learnings that help the 


earner along the road to the ultimate goal. As Dressel says, they “some- 


rbook, Department of Superintendence 


"The Social Studies Curriculum, Yea 7536), pp. 320-340. 


ashington, D.C.: National Education Association, 


330 


Table 10.2 
Relative Advantages of Essay and Other Supply-type Test Items* 


ADVANTAGES 


Easily prepared 
a. Fewer questions to prepare 
b. Need not be mimeographed 
Largely eliminates guessing. 
Stimulates use of superior study 
methods in preparation (as com- 
pared with study methods used in 
preparing for objective tests). 
4.* Represents the most direct method 
of testing many outcomes. For some 
Objectives, test exercises can closely 
approximate criterion behavior. 
5.* Provides more adequate basis for 
making inferences about student's 
level of competency, for example, 
his ability to define words, write 
clear explanations of procedures, 
or complete a geometric proof. 
6.* May give student opportunity to 

demonstrate his ability to: 

à. Choose most pertinent and im- 

portant learnings 

b. Organize his knowledge 

c. Express opinions and attitudes 

d. Show initiative and originality 
7.* May be useful for diagnosing in- 
correct interpretations and partially 
understood concepts. 


THE IMPROVEMENT OF INSTRUCTION 


DISADVANTAGES 


May have relatively low reliability, 
owing to 

a. Limited sampling of learnings 
b. Subjectivity of scoring 

May have relatively low validity, 
owing to 

a. Limited sampling of learnings 
b. Low reliability 

Requires excessive time of students 
in writing 


* Tends to be quickly and carelessly 


constructed, with the result that: 

a. Questions are ambiguously stated 

b. Questions are so general that 
a student can bluff or “talk 
around” the subject 

c. Questions are of unequal diffi- 
culty, with по provision for 
weighting them unevenly in the 
scoring process 

d. Selection of questions is not rep- 
resentative of major learnings 


* Tends to be graded without an ade- 


quate scoring key, so that students" 

marks are affected by 

а. The "halo effect" (good or bad) 
of a student's previous level of 
performance or success on à 
single question of the test 

b. Legibility of handwriting 

+ Errors of spelling and grammar 

d. Effectiveness of written expres- 
sion 


о 


= SSS 


* Variable factors, which are markedly affected by the teacher's skill and care in test con- 


struction and scoring. 


Teacher-Made Tests 


Table 10.3 
Relative Advantages and Disadvantages of Objective 
or Selection-type Items 


————— ———— шщ 


ADVANTAGES 


Makes possible extensive sampling 

Of learnings in a relatively short 

testing time. 

Tends to have high reliability as 

the result of 

а. Objectivity of scoring 

b. Extensiveness of sampling 

Is scored objectively, with the re- 

sult that: 

à. Scoring time is reduced 

b. The teacher is freed from sus- 
Picion of partiality 

€. Students’ scores are not affected 
by such extraneous factors as 
ability to write rapidly and the 
like 

d. Students may do self-scoring 

е. Scoring may be delegated to cler- 
ical workers or readers 

f. Statistical analysis of student 
Performance (on the test and on 
Specific items) is facilitated 

8. At the upper grade levels, tests 
can be scored by machine and 
Summaries made of student suc- 
cess on each test item 

Focuses student attention on spe- 

cific facts or abilities being tested, 

Permitting no evasion. 

May be valuable for such instruc- 

tional purposes as: 

а. Pretesting 

b. Diagnostic testing 

©. Individualized self-testing 

d. Testing for application of prin- 
ciples to new situations 


1. 


4* 


s:t 


331 


DISADVANTAGES 


Requires time and skill for ade- 

quate preparation. 

(Note: Good test questions, how- 

ever, can be reused. Moreover, time 

spent by the teacher in test prep- 

аганоп may result in increased 

awareness of the goals of instruc- 

tion, major concepts to be devel- 

oped, facts related to such concepts, 

and the like). 

Is of limited value in some subjects. 

May stimulate superficial learning 

of many details because of 

a. Emphasis placed on recognition 
of correct answer rather than on 
remembering 

b. Failure to require the student to 
organize significant facts and 
ideas and to reason about them 

May include ambiguous or mislead- 

ing questions. 

May result in unduly high scores 

for intelligent, test-wise students, 

who have лог studied, because of: 

a. Transparent clues in grammar, 
word form, or phrasing 

b. Insufficient care in the devising 
of incorrect responses, allowing 
correct answers to be chosen 
more by the avoidance of ob- 
viously poor choices than by 
recognition of the correctness 
of the right response 


cc MM MMC PX — 


“Variable factors, which are markedly affected by the teacher's skill and care in test con- 


s 
truction and scoring. 


332 THE IMPROVEMENT OF INSTRUCTION 


times forget that a compass does not relieve the traveler from the necessity 
of choosing routes and means of transportation."* 

The ultimate objectives of a course are usually stated in such generalized 
terms that a teacher must think through the specific learnings that con- 
tribute to their achievement before he has an adequate basis for planning 
his tests or other evaluation instruments. 


THE ADVANTAGES AND DISADVANTAGES OF 
ESSAY AND OBJECTIVE TESTS 


The development of objective tests in the period following World War II 
was stimulated by research studies that revealed the very low reliability 
of student scores on the traditional essay test. For a period of time fol- 
lowing these discoveries, the essay test was almost universally in disrepute, 
and the new objective test was almost as consistently praised. Today edu- 
cators recognize the strengths and limitations of each approach. Моге- 
over, there is a growing recognition that many of the criticisms of both 
approaches are not necessarily inherent but grow out of ineffectiveness 
in their application. 

The advantages and disadvantages of essay and objective tests are sum- 
marized in Tables 10.2 and 10.3. Those advantages and disadvantages 
that tend to characterize one approach or the other are listed first, followed 
by other factors that are markedly affected by the teacher’s skill and 
care in test construction and scoring. Many of the claims made for the 
essay examination, for example, are not realized in practice; whereas many 
of the criticisms of the essay examination can be minimized by care in 
construction and scoring. Similarly, the objective examination has poten- 
tialities that may ог may not be realized, depending upon the skill used 
in test construction. 

Ideally, a teacher uses a combination of the two approaches and is 
constantly trying to improve his effectiveness in each, In fact, teachers 
are increasingly combining objective and essay questions in a single test 
in order to obtain both the advantages of the former in terms of more 
extensive sampling, higher reliability, and objective scoring and the ad- 
vantages of the latter in stimulating superior study methods and giving 


the student opportunity to organize his knowledge and express his own 
opinions and attitudes. 


7 Paul L. Dressel, “Measurement and Evaluation of Instructional Objectives,” 17th 
Yearbook, National Council on Measurements Used in Education. (New York: The 
Council, 1961), p. 4. 


Teacher-Made Tests 333 


THE CONSTRUCTION OF TEST ITEMS 


If he is to develop good tests, the teacher must develop skill in the con- 
Struction of the major types of test items. Research studies have revealed 
no order of merit in the various types of test items. Proficiency in writing 
all types will enable the teacher to select the best type of test question for 
each of the various outcomes to be tested. 


Supply-Type Items: Essay Questions 


Teachers like essay questions because of their ease of preparation, but 
dislike them because of the time required for scoring and the difficulties 
of explaining grading to students. The teacher who is willing to spend 
time in the careful formulation of essay questions so that they focus clearly 
On the basic principle or concept to be tested will achieve some of the 
Special values claimed for the essay examination and will also save time 
In scoring the test and interpreting the results. . . . 

Remmers and Gage offer the following suggestions for improving essay 
examinations: 


1. Use essay questions to evaluate achievement of only those instructional 


Objectives not as well or better tested by the short-answer forms. | | 

2. Рћгазе (ће questions so as to require as precisely as possible the specific 
Mental processes operating on specific subject matter. . - - ' | 

жу, phrase the questions so as to give as many hints concerning the 
Organization of the pupil's answers as are not inconsistent with the instruc- 
tional objective at which the questions are aimed. . . . the more specific the 
essay question becomes, the more similar it becomes to short-answer test items. 
Carried to an extreme, this technique would rob the essay question of its unique 
Value in testing the pupil's ability to organize and express his px s We 
Сап attempt to elicit as much organizational effort from the pupils as possible 
While giving them a common set of reference points so that their answers will 
be comparable. - р 

^. Permit no choice among questions. Only by requiring all pupils to nsn 
all questions can their achievement be compared. . » - The тег ea permits 
Pupils to choose among optional questions can never know whether all of them 

ave taken a test of equal difficulty. . . . 

5. Balance the questions in difficulty 
Adequate answers to all of them within 
Tequired achievement.’ 


so that the pupil can actually write 
the allotted time if he possesses the 


7 Measurement and Evaluation 
H. H. Remmers and N. L. Gage. Educational 


(New York: Harper & Row, Publishers, Inc., 1955), РР: 183-184. 


334 THE IMPROVEMENT OF INSTRUCTION 


Since essay tests require considerable time to score, we should con- 
struct questions that will not require students to provide a great deal of 
background information which almost all of them know. If an essay test 
item is to justify the scoring time, it should be so constructed that students 
devote most of their response time to those aspects of the question that 
will differentiate most effectively among them with respect to their achieve- 
ment of some significant outcome of instruction. The students in a measure- 
ment class, for example, may all know the headings of the taxonomy if 
this information has been stressed by the instructor. A question that would 
require them to use a mimeographed list of headings of the taxonomy to 
classify test items and justify their choices would differentiate more effec- 
tively among them, and on a more significant basis, than a question merely 
requiring memory of its major headings and subheadings. 

Before grading students’ responses, the teacher should actually take the 
essay examination himself, listing the points that he expects students to 
make in response to each question. Other acceptable points made by 
students can be added to the key as scoring progresses. If the question 
cannot be analyzed into parts but must be rated as a whole, a sorting 
process is recommended. In using this method the teacher might sort 
students’ responses into piles representing the five letter grades and inter- 
mediate degrees of merit, for example, “A,” "between B and A," “В,” 
and the like. The teacher should later reread those papers that are grouped 
as "between A and B," and all similar classifications, as a basis for 
reassigning them to one grade or the other. 

The teacher should make every effort to avoid the *halo effect" in 
scoring—that is, the effect on a test grade of the teacher's attitude toward 


à student (because of the student's behavior or his general level of past 


performance). Such procedures as the following are helpful: 


1. Keeping the identity of students secret during grading. 

2. Correcting question 1 on all Papers; then question 2 on all papers, and the 
like. ' 

3. Occasionally reshuflling the 
graded unduly low or high 
or poor paper. 


papers so that a student's paper may not be 
because of its position after an unusually good 


Since one of the purposes of essay questions is to encourage originality; 
the teacher should accept and encourage original answers that are based 
on facts and show good thinking. 


Supply-Type Items: Recall or Completion Questions 


Тће most widely used recall items are of two types: a direct question 
that can be answered by a single word or a short phrase, such as “Who 


Teacher-Made Tests 335 


invented the steamboat?" and a simple sentence presented in incomplete 
form, such as “The steamboat was invented by та 

The recall question shares with the essay question the advantages of 
ease of construction and the fact that the student is asked to recall informa- 
tion, rather than merely to recognize the correct response. If the teacher 
Words his questions so as to avoid ambiguity and constructs a scoring key 
that includes all acceptable answers, the scoring can be highly objective. 
Because of its very nature, however, the use of the recall question tends 
to be limited to the testing of descriptive information and associations of 
the who, what, when, and where type. | 

The following suggestions may assist the teacher in improving the 
quality of his recall items: 


1, In general, use recall items only when the correct response is a single word 
Or brief phrase. . 

2. In recall questions of the completion type, it is best to omit only one key 
Word or phrase; the omitted word or phrase should preferably be at the end 
of the sentence. : 

3. Avoid indefinite statements. Be sure that the kind of response wanted is 
clearly indicated; for example, “The steamboat was invented in the year 
rather than "The steamboat was invented in ." (The locality instead of 
the date might conceivably be given in response to the Second question.) . 

4. Make minimal use of stereotyped phrases or other textbook per ; 
avoid placing a premium on the student's recalling a unique word E phrase 
When other responses would indicate understanding of the concept. — 

5. Avoid having the grammatical structure of the question or statement cars 
clue to the response. Use of the article “а” or “an,” for example, will provide 
Students with a clue. 

+ Do not give clues to the answer by va 
blanks. For example, use "Florida was explored by 
"Florida was explored by Ж 


rying the number or length of the 
jl ” rather than 


> 


When the teacher wants to check on the student's recall of specific terms 
ition-type of question 


and facts that are basic to further work, the recogni M UN 
тау ђе quite inadequate. Many teachers either use recognition items as 
admittedly inadequate substitutes or spend hours scoring recall questions. 

У using the ingenious procedure illustrate 
Сап require knowledge at the recall level for 


чећ answer sheets can be easily and objective А 
Scoring stencil; or standard answer sheets providing room for as many as 


Choices can be scored by machine (each student being supplied with 
а “marker” indicating the way in which the letters of the alphabet have 


сеп allocated to the answer spaces). 


d in Figure 10.1, the teacher 
significant terms and facts. 
vely scored by a lay-over 


336 THE IMPROVEMENT OF INSTRUCTION 


Directions: After you have answered all completion questions in the usual manner, 
carefully indicate your answers on this sheet by blackening in for each item the 
answer space corresponding to the third letter of your response. If the name of a 
person has been requested, use his last name only; ignore an apostrophe, 
hyphen, or space; that is, the third letter in O'Brien is r. 


EXAMPLES 


A. For what term is the formula vw 
yz AB CD EF GH IJKLM N OPQR S TU XYZ 
xx 


у о ^ WILL IE TE VL TL LAETI 


Since the answer is standard deviation, the first answer space is 
marked. 


B. What term is used for the 
expression under the 


үү 
AB CD EF GH IJKLM N OPQR 5 TU XYZ 


radical sign? B. | TEILTE E I V I 


Since the answer is variance, the 9th answer Space is marked. 


Do not guess; a penalty for guessing is used in the s 
member that you are indicating the third letter, 


а. ТТПТТ а ИИИ WIL 


AB CD EF GH UK LM N OPQ R S TU 


10. || | ALI LI LA EA 2 DEAE A E NEUE DERE T DE DG 


coring process. Re- 


Fig. 10.1 Answer Sheet Used To Facilitate Scoring of 
Completion Questions, 
жытсыз E eR а 


This approach was devised by C. F. Willey, "Fully Objective Scoring of the 
Completion Test, A paper presented at the 1963 Convention of the American 
Educational Research Association. Reported, in Part, in C. Е. Willey, “Objective 
Scoring of the Completion Test," Psychological Reports, vol. 10 (April 1962), 
pp. 501-502. The first letter is not used since the student might have a hazy recol- 
lection of a term and be able to recall the first letter. The second letter is SO 
frequently a vowel that the third letter seems best, 


a 
Selection-type Items 


Before we discuss true-false, multiple-choice, and matching items, it is 
well to list some general suggestions for item-writin 


that apply to all 
selection-type items. 5 pply 


Teacher-Made Tests 337 


1. All test items should be clearly and simply worded and should be gram- 
matically correct. 

2. Stereotyped expressions and textbook language should be avoided. State- 
ments should not be “lifted” from the textbook. 

3. Questions should be edited to reduce ambiguity of wording; there should be 
only one way in which a statement or term could be interpreted by students. 

4. One should ‘avoid providing clues to the right answer, for example, having 
the right answer longer or more cautiously phrased or using such “specific 
determiners” as are listed in the sections on true-false and other recognition- 


type questions. 


TRUE-FALSE ITEMS Of the various types of questions found in teacher- 
made tests, the true-false item is undoubtedly the most widely used and 
the most severely criticized. True-false items are popular with teachers 
because they seem relatively easy to construct. A large number of true- 
false items can be typed on a single page and can be answered by students 
in а few minutes of class time. Scoring is rapid and easy. Unless they are 
carefully constructed, however, true-false questions are likely to be either 
Obvious or ambiguous. Moreover, since a student has a 50-50 chance 
of answering any question correctly by guessing, true-false questions are 
of little help in diagnosis. Many gaps in the student's knowledge may be 
Concealed by successful guesses. | 

ТЕ they are carefully constructed, however, true-false questions have a 
definite and important place in teacher-made tests. They make it possible 
to test a relatively large sampling of learnings per unit of student time. 
Moreover, true-false items are very well adapted to testing (1) under- 
Standing of principles or generalizations; (2) persistence of popular mis- 
Conceptions; and (3) situations in which there are only two logical re- 
Sponses (north, south; right, left; colder, warmer; larger, smaller; and 
the like), Examples of each of these three types are given below: 


Туре 1. The best way to keep farm soil іп good condition is to plant the same crops 
T each year. 

„Ре 2. Swallowing grape or watermelon seeds 
ype 3. Most of the people of the world live in t 


True-false questions can be used to advantage in instructional tests, 


especially if the administration of such test questions is followed by class 
i generalizations are neither wholly true nor wholly 


discussion, Since many g М d sometimes false." 
alse, the symbol 5 can be added for “sometimes true and 5 alse. 


Eg 1. Two right isosceles triangles are congruent if a leg of one equals a leg of 


causes appendicitis. 
he Northern Hemisphere. 


T the other. Nu 
F S 2. Two isosceles triangles are similar if an 
T sponding angle of the other. " 
F S 3. An equilateral quadrilateral is a square- 


y angle of one equals the corre- 


* Hawkes, Lindquist, and Mann, op. cit., pp- 372-313. 


338 THE IMPROVEMENT OF INSTRUCTION 


In science, industrial arts, and other subjects, true-false questions can 


be used to test the student's ability to apply principles to new situations. 
For example, the following questions are based on a diagram showing 
electrical circuits and switches: 


Yes No 1. If switch D is closed, will it cause a short circuit? 

Yes No 2. If switch S is open, will there be a flow of current? 

Yes No 3. If switches S and A were closed, would this create a short circuit? 
Yes No 4. Can light B be controlled by switch D? 

Yes No 5. 


- Does the current in this circuit flow from X to Y?10 


In instructional tests, many teachers prefer an adaptation of the true- 
false question that requires the student to indicate why a statement is false 
or to revise the statement so as to make it true, 


Directions, Some of the following statements are true and some are false. If the state- 
ment is true, encircle the "T^ at the left and do no more. 


If the statement is false, 
encircle the “F” and do two more things: 


1. In blank “A” insert the word that makes the statement false. 
2. In blank "B" insert the word that would make it true. 


DO NOT USE WORDS THAT ARE UNDERLINED. The first item is answered as an example. 


(xX) T ® Large city newspapers are printed on cylinder presses, 
А. (cylinder) 


B. (rotary) 


(1) T F The optical center of a раде lies just below the true center. 
А. 


(2 T F Fir plywood is sold by the board foot,11 
А. 


В. 


In order to retain the values inherent 
have the advantages of objective scorin 
true-false question can be used: 


in this type of question and still 
g the following variation of the 


Directions. Some of the following statements are true and some are false. If the state- 
ment is true, encircle the "T^ preceding the item and do no more. If the statement is 
false, encircle the “F” and do two things: ` 


Т. Underline the word that makes the statement false. 
2. From the list just below, 


select the word that Would make the item true and place 
ihe letter preceding that 


word in the blank space before the item. 


10 From William J. Micheels and M. Ray Karnes, Measuring Educational Achieve- 
ment. Copyright 1950. McGraw-Hill Book Company, Inc., р. 208. Used by permis- 
sion. у 


11 From William J. Micheels and M. Ray Karnes, Measuring Educational Achieve- 
ment. Copyright 1950. McGraw-Hill Book Company, Inc., р. 203. Used by permis- 
sion. 


Th 


a 


(2 


Teacher-Made Tests 339 


e first item is answered as an example. 


A. 10 G. attract M. parallel 
B. 15 H. cell N. repel 
С.:25 1. соррег О. series 
D. alternating J. current P. steel 
E. aluminum K. direct Q. voltage 
F. ampere L. ohm R. watt 
(x) T Ф) The volt is the unit of resistance. 
) T Е_____ Ра DC circuit has a pressure of 20 volts and a current of 5 
amperes, the resistance would be 4 ohms. 
) T F. The elements of a telegraph circuit are connected in parallel.12 


The following suggestions may assist the teacher in constructing better 


true-false items: 


m. 


N 


p 


Plete statement is followed by several p 


The questions should be related to significant facts or generalizations. The 
best true-false statements require the student to understand a significant fact 
9r generalization presented in a new way. | 

The crucial element in the statement should be readily apparent to the stu- 
dent. Ordinarily it should be placed in the main clause and near the end 
ОЁ the statement. Underlining the crucial word or words may be desirable. 
Avoid "lifting" true statements directly from the textbook or developing 
false statements by the mere insertion of the word "not" into such a lifted 
Statement. Not only does such a procedure encourage rote learning, but 
Some textbook statements are ambiguous when removed from their context. 
Avoid the use of specific determiners, words, or phrases that are usually 
associated with either a true or a false statement. Such words as “all,” 
"none," "always," "never," and the like are usually associated with false 
Statements; whereas statements containing “some,” "generally," "may," 
"should," and the like are usually true. 
Avoid making true statements consistently longer than false ones. | 

Have a somewhat larger number of false than true statements. This sugges- 
tion is made because the student who does not know the answer is more 
likely to guess “True” than “False.” 


Avoid statements that are partly true an 5 | 
Speed up scoring by typing the symbols T and F in a column (preceding 


9r following the questions) so that students can mark their choices, and a 
lay-over scoring stencil (with holes in the positions for correct responses) 
сап be used. If no answer column has been typed on the test, or if the test 
questions are dictated, have students write the symbols + and 0. These are 
more easily distinguished in scoring than T and Рог + and —; they are also 
less easily changed by students when self-scoring is used. 


d partly false. 


MULTIPLE-CHOICE ITEMS In a multiple-choice item, either an incom- 
ossible completions; or a direct 


12 From William J. Micheels and M. Ray Karnes, Measuring Educational Achieve- 


ment. Copyright 1950. McGraw-Hill Book Company, Inc., р. 208. Used by permis- 


Sion, 


340 THE IMPROVEMENT OF INSTRUCTION 


question is followed by several possible answers. Typically, a multiple- 
choice question is designed so that only one answer is correct. However, 
variations have been developed in which (1) the student selects the best 
answer from a number of responses that vary in their acceptability; (2) 
the student selects one incorrect or otherwise inappropriate response from 
a group of three or more; ог (3) the student checks two or more correct 
responses from a list of several alternatives. 

Multiple-choice items can be designed to require reasoning and judg- 
ment as well as a knowledge of facts. Multiple-choice items that are well 
constructed tend to be much more valid and reliable than an equivalent 
number of true-false items. They are applicable to evaluating growth 
toward a wide variety of instructional goals. They are usually well liked 
by students. They can be easily and objectively scored. 

On the debit side should be mentioned the fact that good multiple- 
choice items are difficult to construct. They require more student time in 
responding and more space on the page than do true-false or recall items. 
That is, the greater reliability of multiple-choice items is counterbalanced 
in part by the smaller number of items (and therefore the smaller sampling 
of learnings) that can be tested in the period of time required by a much 
larger number of true-false questions. 

Because of their greater difficulty of construction, multiple-choice items 
should probably not be used when a simple recall item is adequate—that 
is, when there is clearly only one correct response and that response is a 
single word, number, or brief phrase; or when there are only two plausible 
responses (for example, right or left, North Pole or South Pole, safe or un- 
safe, and the like). In the latter case, a true-false item is usually effective. 

The value of a multiple-choice item depends largely on the skill with 
which the incorrect choices, or distractors, are written. For some multiple- 
choice items, five plausible alternatives can be constructed; for others, 
the teacher may have only three plausible choices. In the latter case, the 
inclusion of two additional choices that are obviously false adds nothing 
to the measurement value of the question. 

Although the number of available choices should not v. 
throughout a test, there is no reason why a teacher cannot u 
choice technique for a group of questions where there are 
responses. For example, in a science test, a situation might be described 
and a series of conditions listed, for each of which the student would check 


: ives: 48 
one of three alternatives; “increases growth,” “decreases growth,” or 
“no change."!* 


ary at random 
se the multiple- 
as few as three 


13 Williams and Ebel studied the effects on test reli 
plausible alternatives from the items of an expertly со 
lary test. Reducing the number of alternatives to thre 
working, while reducing them to two increased their s 


ability of omitting the least 
nstructed four-choice vocabu- 
€ increased students’ speed of 
peed considerably. For tests of 


Teacher-Made Tests 341 


The following suggestions may be helpful in improving the quality of 
multiple-choice items. 


1. As much of the item content as possible should be put in the stem of the 
item." If this is done, the informed student will have the answer in mind 
before he scans the options given. Moreover, space and student time are 
conserved because repetition of words in the various options is avoided. 

2. The inexperienced item writer may find it advisable to use the direct ques- 
tion rather than incomplete sentence form, since the question form forces 
him to state the problem clearly and also reduces the risk of giving the 
Student clues through grammatical inconsistencies. However, the more ex- 
perienced item writer will prefer the incomplete sentence, because careful 
phrasing of the stem may reduce the length of the options. 

3. Make all responses plausible. It should be necessary for the student to read 
and consider all choices presented. The alternative choices should deal with 
the same family of ideas—that is, should be reasonably homogeneous with 
respect to period of history, geographic area, or other basis of classification. 


For example, in the question 


Which of the following men invented the telephone: (a) Edison; (b) Bell; (c) Marconi; 
(d) Morse? 


all four choices are inventors in the field of communications in the same 
period of history; hence, the question is a better one than if the responses 
were less homogeneous. 

4. The correct answer should not 

3i Avoid giving clues to the unprepared 
tion or other means. The incomplete 
grammatical clues—use of “a” or "an, 
Verb, and the like. АП options must b 
Stem. The question 


be consistently longer than the incorrect ones. 
student through grammatical construc- 
sentence is especially likely to include 
" use of singular or plural subject or 
e grammatically consistent with the 


The explorer who claimed the Mississippi Valley for France was: (а) Pizarro; (b) LaSalle; 


© Cabot; (d) Hudson; (e) Smith. 


is a one-choice, rather than a five-choice, question for a student who can 


& identify French names. 
z Avoid the use of textbook language or st 
- The position of the correct answer shoul 
test. 


ereotyped phrases. 
а be randomized throughout the 


group of recall or completion 


Many teachers find that administering а i 
e incorrect responses, which 


items often produces a number of plausibl 


equal working time, three-choice items gave а test of equal reliability, and two- 
Choice items gave a test of slightly higher reliability. B. J. Williams and Robert L. 
Ebel, “The Effect of Varying the Number of Alternatives per Item on Multiple- 
Choice Vocabulary Test Items,” 14th Yearbook, National Council on Measurements 
Used in Education (New York: The Council, 1958), pp. 63-65. : 

™ The direct question or the incomplete statement that poses the problem is called 


the "stem" of the test item. 


342 THE IMPROVEMENT OF INSTRUCTION 


they can use in developing a multiple-choice test for later use. Incorrect 
responses obtained in this way are often more effective than ones that the 
teacher could invent, since they represent genuine sources of confusion for 
the students. In mathematics, the false alternatives in multiple-choice 
questions can be selected so as to represent typical errors or misunder- 
standings by students. 


MATCHING EXERCISES А matching exercise is a special type of multiple- 
choice question. The usual multiple-choice question has a single problem 
or stem, followed by two to five options. In a matching exercise, there 
are several problems or questions; the answer to each one is to be chosen 
from a single list of options. That is, the same list of alternative responses 
is used for several test items included in one matching exercise. Obviously, 
a matching exercise has the advantage of compactness because it requires 
less space on the page than multiple-choice questions based on the same 
content. Such exercises are rather easily constructed and are easily and 


objectively scored. In the following example, each of the options can be 
used more than once. 


Directions: Listed below are several types of lubricants, followed by automotive units 
that require one of these lubricants. You are to match each unit with the correct type of 
lubricant and place the identifying letter in the blank space provided. Use each letter 


(A, B, and the like) as many times as is necessary. The first item is answered as an 
example. 


ХС) (X) transmission . engine oil 
— 1. distributor - fibrous grease 
E - gear oil 


. universal joint - lubricant impregnated 


+ penetrating dripless lubricant 
+ pressure gun lubricant 


A 

В. 
2. striker plate C 
3 D. 
4. dovetail E 

— 5. differential Е 
6. door hinges 
7. generator 
8. front wheel bearings 

— 9. drag-link ends 

— 10. spring pins 

— M. steering gear 

— 12. carburetor air cleaner 

— 13. spindle pin 

— 14. drive shaft center bearing15 


Matching questions, however, have their special limitations and hazards. 
They are not well adapted to testing student knowledge in small units of 
subject matter; it is difficult to find items that are sufficiently homogeneous 
to require much discrimination on the part of students, Matching exercises 


are more likely than any other type of objective test item to include 
irrelevant clues to the correct response. 


15 Adapted from Micheels and Karnes, op. cit., p. 233. 


Teacher-Made Tests 343 


The items in at least one of the two lists of a matching exercise should 
consist of single words or numbers or very brief phrases. Hence, matching 
exercises are well adapted to who, what, when, and where types of learn- 
ings and are usually considered ineffective in testing for understandings. 
An exception is the matching of items to indicate cause-effect relationships, 


as in the following examples: 


COLUMN 1 COLUMN 2 
(Statements of Effects) (Statements of Causes) 
(C) (X) Fishing in Norway A. High, snow-capped mountains 
B. People depending on wild animals 
() 1. Wandering life of the Eskimos for food 
C. Poor, rocky, forest-covered soil near 
() 2. Transportation by сате! the sea 
D. Cool, damp climate; much low, wet 
() 3. Houses of wood and bark with ground 


E. Hot, dry climate, much soft, loose sand 

F. Hot, rainy climate; many raffia-palm 
trees 

G. Broad, level plains; rich soil; moderate 
summer rain?® 


steep roofs of leaves 


() 4. People wearing wooden shoes 
when working 


e, involving cause-effect relation- 
ken from a standardized test in 
tion of the matching exercise, in 
(134-138 below) are classified 


Another example of a matching exercis 
Ships, is the following set of items, ta 
Science. Actually, this represents а varia 
which a series of multiple-choice items 
according to a key-list. 


After each of statements 134—138, mark the letter designating the phrase below that 


will make the statement true. 


F if bacterial growth will be encouraged 
G if bacterial growth will not be affected 
H if bacterial growth will be decreased b 
1 if bacteria will be killed but spores will remain alive 
J if both bacteria and spores will be killed 
134 Place bacteria in a refrigerator overnight 


135 Expose bacteria to a direct flame for 30 seconds | 
136 Keep bacteria at a temperature of 98° Fahrenheit for 48 hours 


137 Рига solution of boric acid on bacteria 
138 Place bacteria on a medium of agar an 


ut will continue 


d beef broth!? 


. The type of classification exercise given above is useful in many sub- 
ject areas. For example, the key list might be a list of standard reference 


16 Adapted from N. Theresa Wiedefeld and E. Curt Walther, Wiedefeld-Walther 


Geography Test (New York: Harcourt, Brace & World, Inc., 1931). | 
17 Reprinted with the permission of the California Test Bureau from Georgia 


Sachs Adams, William E. Keeley, and John А. Sexson, California Tests in Social and 
Related Sciences, Advanced, Form AA, Part Ш (Monterey, Calif.: California Test 


Bureau, 1954), items 134-138. 


344 THE IMPROVEMENT OF INSTRUCTION 


works; and the items, a series of study questions, to be classified accord- 
ing to the reference work most appropriate for answering each question.*® 

In a social studies unit in which the characteristics of democracy had 
been compared with those of other forms of government, a series of items 
could list several characteristics of, or practices under, democratic and 
totalitarian governments. Preceding this series of items, the following 
directions could be used: 


After each item number on the answer sheet, blacken one lettered space to designate 
that the item is characteristic of the theory of 


А liberal democracy. 

B Communism. 

C Fascism. 

D both Communism and Fascism. 

E both liberal democracy and Communism.19 


The following suggestions may help in the writing or revision of match- 
ing questions. 

1. Do not include too large a number of items in either column. The number 

should probably vary from a minimum of 5 to a maximum of 12. The use 

of longer lists requires the student to spend too much time in hunting for 
the correct responses. 

2. In any one question, do not mix items that are highly heterogeneous or dis- 
similar. For example, do not include in a single matching exercise items that 

require the matching of men and inventions with others that require match- 
ing of battles and dates. 

. The column of responses or options should include more alternatives than 
the column of questions or test items, in order to prevent the student from 
selecting the last response on the basis of elimination. 

4. It is frequently advisable to allow certain items i 
be used more than once so as to reduce the effe 
is used, the preceding suggestion becomes unnece 

5. If possible, the response column should contain 
question column so that the student can scan th 

6. The items in the response column should be ar 
sible (names in alphabetical order, dates in 
like). Note that in the example on page 343, 
order of efficacy of bacterial control. 

7. Double check to make sure that there is о; 
column that is the correct answer for each 
indicate that responses may be used more th 

8. Avoid requiring the student to match 
of the probability of introducing gram 

9. Be sure that a matching exercise арр 


n the response column to 
ct of guessing. If this plan 
ssary. 

shorter statements than the 
€ possible responses quickly. 
ranged systematically if роз- 
chronological order, and the 
responses Е through J are 10 


nly one item in the response 
test item (unless the directions 
ап once). 

parts of incomplete sentences because 
matical clues to the correct responses. 
ears on a single page of the test. 

18 For an illustrative question of this type, see Georgia Sachs Adams and T. L. 
Torgerson, Measurement and Evaluation for the Secondary School Teacher (New 
York: Holt, Rinehart and Winston, Inc., 1956), р. 394. 

19 Max D. Engelhart, “Exercise Writing in the Social Sciences,” Proceedings of 


the 1957 Invitational Conference on Testing Problems (Princeton, N. J.: Educational 
Testing Service, 1958), p. 61. 


Teacher-Made Tests 345 


A number of modifications of matching questions can be used. For 
example, students may be given a map or chart on which certain locations 
are assigned numbers or letters. These numbers or letters can then be 
matched with a list of cities, rivers, and the like. 


Situation- or Problem-solving Items 


Situation- or problem-solving items deserve special mention because of 
their value in measuring student understandings, as contrasted with mem- 
orization or rote learning. A large part of the skill required in constructing 
tests of this type is involved in devising situations or problems that require 
the student to make interpretations from new data or to apply principles 
learned to new situations. The following examples are taken from the 


California Tests in Social and Related Sciences. 
James wanted to be elected president of the school council. During the month before 
election he did the following things. Which one do you think was undemocratic? 


hat he would do if elected president 


а. presented his plans concerning wi 
“ and not allow a country boy to win 


b. appealed to the city boys to “stick together 
the office 

с. gave a talk in assembly asking the children t 

d. told the school that he would try to have more sc 

-way telephone set below. Which two of the following 


o vote for him 
hool picnics if he were elected?? 


Examine the diagram of a two 
statements are true? 


In order to function properly, 
Diaphragm J vibrates when someone is talki! 
In receiver C, the purpose of electromagnet 
Current cannot flow in AE unless someone is 
20 Reprinted with the permission of The California Test Bureau from Georgia 
Sachs Adams and John A. Sexson, California Tests in Social and Related Sciences, 
plementary, Part I (Monterey, Calif.: California Test Bureau, 1953), Form AA, 
em 83, 
Sach Printed with the permission of Th 
P. S Adams and others, California Tests 11 
art III (Monterey, Calif.: California Test Bureau, 


G must be connected more directly. 


circuits AE and B e 
ng into transmitter B. 


NS is to vibrate diaphragm K. 
talking into it.?* 


D: гр 


e California Test Bureau from Georgia 
1 Social and Related Sciences, Advanced, 
1954), Form AA, item 41. 


346 THE IMPROVEMENT OF INSTRUCTION 


Problem-solving tests may be of the essay or completion, as well as the 
multiple-choice, type. For example, a map, graph, or table can be pre- 
sented and students can be asked to draw generalizations from it. Such 
generalizations can later be used by the teacher in building a multiple- 
choice test on the same material. 

Although situation-type exercises are difficult to construct, they are 
valuable in that they stimulate the functional use of facts and generaliza- 
tions. А number of illustrations of situation- or problem-solving items 
are given in the next chapter, in which we present illustrative exercises 
for each category of the taxonomy. 

Since situation- or problem-solving items may be of many types, it is 
impossible to make a list of specific suggestions for their construction. 
It is important that the situation contain an element of novelty without 
being so novel as to be unrealistic or confusing. The situation should be 
a challenging one to the students, and the alternative choices should all 
be plausible. The student should not be able to 
without careful reading of the question and critical thinking. Ordinarily, a 
good situation-type problem is one that would provide a good basis for 
class discussion. Before he attempts to write the alternative responses 
necessary for an objective test of this type, the teacher may wish to use а 
situation-type test as a basis for class discussion or for written responses 
by students. 

Situation-type items are frequently used to measure the student's ability 
to apply principles learned in new situations. Hence, Dunning's step-by- 
step outline for developing tests on application of principles is relevant: 


make the correct responses 


Step 1. 
considered: 
à. Should be known principles but the situati 
to be applied should be new. 
Should involve significantly important principles, 
Should be pertinent to a problem or situation common to all students. 
Should be within the range of comprehension of all students. 


Should use only valid and reliable Sources from which to draw data. 
Should be interesting to the students. 


Step 2. Determine the phrasin 
the student in drawing his conclu 
а. Make a prediction. 


Decide on the principle or Principles to be tested. Criteria to be 


on in which the principles are 


зо ере 


2 of the problem situations so as to require 
sion to do one of the following: 


b. Choose a course of action. 

с. Offer an explanation for an observed phenomenon. 

d. Criticize a prediction or explanation made by others, 

Step 3. Set up the problem situation in Which the principle or principles 
selected operate. Present the problem to a class With directions to draw a con- 
clusion or conclusions and give several Supporting reasons for their answer. 


Teacher-Made Tests 347 


Step 4. Edit the students' responses, selecting those that are most repre- 
sentative of their thinking. These will include conclusions and supporting 
reasons that are both acceptable and unacceptable. 

Step 5. To the conclusions and reasons obtained from the students, the 
teacher now adds any others that he feels are necessary to cover the salient 
points. The total number of items should be at least 50 percent more than is 
desired in the final form to allow for elimination of poor items. The following 
list is a guide to the type of statements that can be used: 

à. True statements of principles and facts. 

b. False statements of principles and facts. 

с. Acceptable and unacceptable analogies. 

d. Appeal to acceptable or unacceptable authority. 

€. Ridicule. 

f. Assumes the conclusion. 

&. Teleogical explanations. 
Step 6. Submit test to other judges for criticisms. 
criticisms. 

Step 7. Administer test. Follow with thorough class discussion. 

Step 8. Conduct an item analysis. 

Step 9. In the light of steps 7 and 8, revise the test.” 


Revise test in view of 


EVALUATING OBJECTIVE TEACHER-MADE TESTS 


One of the best ways for a teacher to improve his skill in test construc- 
tion is to evaluate his own tests—that is, to apply the generalizations 
developed in this chapter to specific test items. As a guide for such evalua- 
tion, the following checklist has been prepared. In applying this list 
to a teacher-made test, the teacher should check any relevant criticisms 
and note the specific test items or other characteristics that justify 


each check mark. 


LTY TEST ITEMS AND OTHER ERRORS 


CHECKLIST FOR DISCOVERING FAU 
B : ECTIVE TESTS 


IN THE CONSTRUCTION OF OBJ 


Selection of Content 
— 1. Failure to set up a table of spec! 
the proportionate emphasis to be g 


areas. А : istributi 

— 2. Failure to follow a test blueprint in the approximate distribution of 
test items. 

— 3. Failure to emphasize the im 
emphasize mere details. 


fications, or test blueprint, indicating 
iven to various objectives and content 


portant facts and generalizations; test items 


22 Gordon M. Dunning, *Evaluation of Critical Thinking," Science Education, vol. 


38 (April 1954), pp. 191-193. 


348 THE IMPROVEMENT OF INSTRUCTION 


4. Introduction of material that is appropriate only for essay and discus- 
sion questions. 


Other Factors Affecting Validity 


— 5. Poor wording of items such as use of (a) bookish terms, (b) long in- 
volved phrases or statements, (c) vocabulary too advanced or unneces- 
sarily technical. | 

— 6. Use of "specific determiners"—that is, clues afforded by phrasing, which 
tend to determine the student's response in the absence of knowledge. 
Specific examples are given in checklist item 23. 

— 7. Ambiguous statements. 


. Failure to give all pertinent information necessary for student to choose 
answer. 

Technical Make-up 

— 9. Lack of directions. 


—10. Directions not specific or clear; failure to include sample exercise for 
unfamiliar type of item. 


——11. Use of items that help student to answer other items. 


Other Factors Affecting Reliability 

—12. Too few questions to ensure an ade 
tested. 

—13. Difficulty range not adequate for purpose. 

—14. Scoring not entirely objective—for exam 
clude all possible answers to recall questio 


quate sampling of the material 


ple, scoring key does not in- 
ns. 


Physical Make-up 


—15. Mimeographing (or other means of duplication) poorly done. 

——16. Position of true-false questions, or position of answer-responses to 
multiple-choice questions, not randomized. 

—17. Faulty arrangement of the test items (questions crowded together; ques- 
tions not grouped according to type; answer blanks scattered so that 


scoring is tedious; multiple-choice or matching question divided, that is, 
typed or printed on two different pages of test). 


ERRORS PECULIAR TO A SPECIFIC TYPE OF TEST ITEM 
True-False 


—18. Statements that are 


partly true and partly false, 
=; 


Number of true statements excessiv 


ely large. (Students who do not 
know answers tend to guess "true" more fre 


—20. Statements “lifted” from the text. ie dr 

__21. Use of true statements that are overlong. (A study conducted by a 
committee in one of the test-construction classes at the University of 
Wisconsin found that statements of 25 words or more were true in 80 
percent of the cases). 

— 22. Use of double negatives. 

—23. Use of “specific determiners.” Items containin 
often false than true: totally, entirely, exactly, 
exclusively, very, perfectly, absolutely, 
only, alone, never. Items containing t 


g the following are more 
completely, solely, fully, 
all, always, no, none, not, nothing, 
he following are more often true 


Teacher-Made Tests 349 


than false: was one of, may, usually, generally, most commonly, as a 
rule, as a whole, even if, almost all, mainly, almost entirely, some, some- 
times, often, frequently, several, many, probably, approximately, largely, 
ever. 

— 24. Crucial element of statement placed in a phrase or a subordinate clause. 


Multiple-choice 

— 25. More than one correct response (unless question is intended to have 
two or more correct responses). 

— 26. Distractors (false alternatives) that are not plausible. 

——27. Responses too long, involving needless repetition (which could be re- 
duced by rewording stem of item). 

—28. Correct responses longer or more cautiously worded than others. 

—29. Responses that are grammatically inconsistent with stem. M 

— 30. Clues that help the unprepared student to select correct answer or elimi- 
nate one or more distractors. 


Matching 

— 31. Inconsistent or heterogeneous mater 

— 32. Too few (less than 5) or too many 

— 33. Column of alternative responses not а 
other suitable systematic manner. 

— 34, Same number of items in both columns. 

—35. “Specific determiners”: identical elements, 
matical clues. 


ial within a column. 
(more than 15) items per question. 
Iphabetized or arranged in some 


illogical statements, gram- 


Completion and Simple Recall 

— 36. Lack objectivity—that is, too 
— 37. Blanks not uniformly arranged for easy 
— 38. Too many blanks per item. 

— 39. Item consisting of a sentence from 
— 40. Clues that help the unprepared st 
blanks of varying length). 

— 41. Items requiring recall of trivial in 


many possible answers. 
scoring. 


a text with one or two words omitted. 
udent (such as number of blanks or 


formation. 


PREPARING A TEACHER-MADE TEST FOR USE 


ed with the planning of a test blue- 


Up to this point, we have been concern à 5 
s of items. These are crucially 


Print and the writing of different type | l 
important topics. The quality of a test is determined largely by the quality 


of its items, Careful attention to the editing of items, the writing of direc- 
tions for students, the arrangement of items within the test, and the design 
Of answer sheets and scoring keys can further improve test quality. 


Editing Items 


ell in advance of the time that they are 


Test items should be written W - А 
needed. The teacher who hastily reviews items he has just composed may 


350 THE IMPROVEMENT OF INSTRUCTION 


fail to catch ambiguities, for he reads into the item what he intends to 
communicate. Ideally, another teacher in the subject area should review 
the items, criticizing them and also indicating what he considers to be the 
correct responses. In requesting such assistance from a colleague, the 
teacher can emphasize that he would much rather revise or eliminate 
items than face the ill feelings that inevitably develop among students 
when test questions are ambiguous. In addition to revising ambiguous 
questions at this time, the teacher should try to substitute improved 
options for any which are not plausible, revise the stem and/or options 
for any items incorrectly keyed by a colleague, and eliminate (or revise) 
any items that appear to be extremely easy or difficult for the group. 

The teacher should have written at least 20-30 percent more items than 
are needed so that (1) he can afford to discard items that cannot be 
satisfactorily revised and (2) he has some *elbow room" in adjusting the 
number of items to his table of specifications. 


Grouping Test Items 


After the items have been reviewed and the necessary changes made, 
they should be arranged for typing. If items are on cards, their arrange- 
ment in any desired sequence can easily be accomplished. There are three 
possible bases for grouping: (1) type of item (that is, true-false, recall, 
and the like), (2) difficulty of item, and (3) content. The chief reasons 
for the first type of grouping are (1) so that instructions for answering 
can be carried throughout a set of items, and (2) so that standard answer 
sheets can be used. This first basis for grouping is almost uniformly used. 

Within each type of item (true-false, recall, and the like), the teacher 
can choose to group either by difficulty or content. In a very long test, he 
might do both. Arranging items in approximate order of difficulty is 
especially important if there is insufficient time for all students to com- 
plete the test; thus the slow-working student does not waste time on ques- 
tions that are “beyond him," nor does he get discouraged early in the test. 

If a class is fairly homogeneous with respect to student ability, and 
adequate testing time is allowed, arranging items by approximate difficulty 
may be less important than arranging items by content. Arranging items 
according to content makes the test have higher face validity for the 


examinee and may aid the teacher in judging the areas in which review 
or class discussion is desirable. 


Writing Directions for Examinees 


For high school or college students who are test-wise, directions for 
the usual true-false or multiple-choice items may seem superfluous. How- 
ever, it is far better to include instructions than to have some students 


Teacher-Made Tests 351 


make mistakes in their method of indicating replies, or to have class time 
wasted by questions on test procedure. 

The following sample directions may be used or adapted if a home- 
made answer sheet (similar to Figure 10.2) is used. If students indicate 
their answers on the test itself, obvious changes can be made to make 


the directions simpler and briefer. 


True-False Items 

nts. Indicate your answers on the separate answer 
cross out the A on your answer 
cross out the B. 


Read each of the following statemel 
sheet. Make no marks on the test. If a statement is true, 
sheet. If a statement (or any part of a statement?) is false, 


Multiple-Choice Items 
our answers on the answer sheet. Make 


‘ons or choices best completes the state- 
hoice A is the best answer to Item 1, 
heet; if you think choice B is 


Read each of the following items. Indicate y 
no marks on the test. Decide which of the opti 
ment or answers the question. If you think that cl 
cross out the A in the row after No. 1 on your answer = 
best, cross out the B, and the like. 


Period 


Name 
(first name) 


(last name) 
Date. 


Course 


Title of examination. 


d above. Make no marks on your 


Directions: Fill in all the information requeste 
test. Read the directions on the test carefully, and follow directions exactly. For 
each multiple-choice item, mark your choice for the correct answer by crossing 
out the letter that corresponds to your choice. For True-False items, indicate your 


choice of True by crossing out the A; or Your choice of False by crossing out the B. 


е Mimeographed Answer Sheet (to be used with 


Fig. 10.2 Ап Illustrativ 
g stencil). 


a lay-over scorin 


352 THE IMPROVEMENT OF INSTRUCTION 


Sample directions for modified true-false items have been given on 
page 338 and for a matching item on page 342. In Chapter 11 a number 
of items with which students might not be familiar are included. The 
directions given for these items will be helpful to teachers in devising 
instructions for similar types of items. 

If a test is being given with a liberal time allowance, as is true for most 
teacher-made tests, students can be asked to answer every question. Under 
such circumstances, simple number-right scores rank students in the same 
order as scores corrected for guessing. Hence, unless the teacher wishes 
to discourage guessing, no correction formula need be used. 

If the teacher wants to use a “correction-for-guessing” formula, tests 
should be scored for both rights and wrongs. The usual formula for a 
corrected score is 
Number wrong 


"nu 


Number right — 


where n is the number of options. For true-false questions, substitution 
of 2 for n gives a corrected score of R-W. For five-choice questions, one- 
fourth of the number wrong is subtracted from the number right. Since 
this formula tends to penalize the overcautious student,?? the test instruc- 
tions should encourage students to utilize their partial knowledge. The 


following advice to students taking College Entrance Board Examinations 
is much better than a warning not to guess. 


Many candidates are unsure of the wisdom of guessing at the answers to 
questions about which they are uncertain. For each of the College Board 
Achievement Tests, . . . а percentage of the wrong answers is subtracted from 
the number of right answers as a correction for haphazard guessing. It is 
highly unlikely, therefore, that blind guessing will improve your score signifi- 
cantly; it may very well lower it, and it does take time. Often, however, you 


28 The formula for “correction for guessing" is based on the assumption that the 


student's chance of picking the correct response on items he doesn't know is L That 
is, if there are n choices, the chance probability of a correct guess is 1 and of an 


n-i 

unlucky or wrong guess, Za > For every n -1 wrong guesses, we expect one correct 
guess. Hence, to correct for guessing, one subtracts Ме" wrong from the score 
n-i 

(number right). This formula is based on the faulty assumption that items can ђе 
categorized neatly into those the student knows and those he does not know. When 
a correction formula is used, the student who is willing to gamble has an advantage 
over the student who omits all items about which he is doubtful. The student who 
is willing to gamble capitalizes on his partial knowledge as well as weaknesses in 
test items; that is, if a student can easily eliminate two of the five options as illogi- 
cal, his chance of getting the item right is higher than the formula assumes. 


Teacher-Made Tests 353 


will not be sure of a correct answer but will have some knowledge of the 
question. If you can eliminate one or more of the answer choices as definitely 
wrong, it will be to your advantage to answer the question even though you 
must make a guess as to which of the remaining answers is correct.?* 


Instructions should be given regarding the policy students should follow 
when they are uncertain about answers. These instructions should be 
included as part of the regular set of directions, rather than given orally, 
or in response to individual students who come up to inquire about it. 


Physical Make-up of the Test 


A. common mistake in teacher-made tests is to try to crowd too much 
material on a page. As a rule, each option for a multiple-choice item 
Should be typed on a separate line. If the responses are very brief, two 
Columns can be used, as on page 371. The items that refer to a map, table, 
ог chart should be on the same page as that material. А matching exercise 
or multiple-choice item should not be divided between two pages. 

If a separate answer sheet is not being used, the test itself should be 
designed to facilitate scoring. Most students in the fourth grade and above 
Can copy their answers in an answer column so that the teacher can lay 
his scoring key beside this column. If the teacher wishes us use a curant, 
lay-over scoring stencil, students can encircle the words “true” or false 
(ог the letters associated with options) that are typed in an answer 
Column, Question numbers should be repeated in the answer column. 

If students are being asked to encircle the letter of an option, it 15 best 
Not to repeat the same letters for all questions but to use the following 
plan, which helps younger students to keep their place. 


lABCODE 
2 FG Н 1171 
3. ABCDE 


Needless to say, the options have to be lettered correspondingly in the test 


items, 


If separate answer sheets are used for teacher-made tests given in ele- 


mentary schools, the test should be typed first, and then the answer sheet 
Should be typed so that the answer spaces are right in line an the end 
Of each test question, rather than uniformly spaced as in Figure 10.2. 
With the use of this procedure, plus the alternation of бараат, a 
gested above, separate answer sheets can be used with many fourth-, Ек 

24 4 Description of the College Board Achievement Tests (Princeton, N. J.: Edu- 
Cational Testing Service, 1962), р. 17. 


354 THE IMPROVEMENT OF INSTRUCTION 


and sixth-grade classes. Answer sheets have advantages in ease of scoring, 
economy of reusing tests, and use in item analysis, discussed in the next 
chapter section. 


STATISTICAL ANALYSIS OF TEST RESULTS 


The authors of standardized tests always write many more test items than 
are needed and decide, on the basis of a preliminary tryout, which items 
should be revised or eliminated. They study the difficulty of each item so 
that they can select items which meet desired standards with respect to dif- 
ficulty. They also study the relationship of each item with the criterion. 
Reference to the summary tables of Chapter 4 will reveal that if a predictor 
test is being devised, the author's goal is to select items that show a fairly 
high relationship with an external criterion. If an achievement test is being 
developed, the criterion is usually total test score. That is, on an achieve- 
ment test, efficiency of measurement can be improved by discarding or 
revising items that do not correlate well with total score, or do not dis- 
criminate between high-achieving and low-achieving students (on the test 
as a whole), Such a study of the difficulty and discrimination value of 
items is called an item analysis. 

Before we consider item analysis procedures that are more suitable for 
teachers, we will discuss a procedure frequently used by authors of stand- 
ardized tests, that is, finding for each item its bi-serial r with total test 
score. Bi-serial r is a special type of correlation coefficient, computed 
when one of the variables is dichotomous (that is, has only two possible 
values). The score for an item is either 1 or 0, but the total test score can 
have many values. Hence this type of coefficient is appropriate. 

Since bi-serial r is used so frequently in test construction, short-cut 
procedures have been devised. One can тегеј 


Р à y compute the proportion 
of students succeeding on an item in the high-scoring group and the corre- 
sponding value for the low-scoring group; then one looks these values 


up in a table and finds the value of bi-serial r.2° The test author then 
selects items with fairly high bi-serial r's that also provide a satisfactory 
distribution of item difficulties and (in achievement tests) a satisfactory 


?5 For machine-scored tests, the number of students succeeding on each item is 
easily obtained by the use of an item-count attachment to the test-scoring machine. 
Item data for one hundred test papers can be counted by machine in ten minutes. 
An illustration of a graphic item count record is given in Figure 14.4. 

?5 Chung-Teh-Fan, /tem Analysis Table (Princeton, N.J.: Educational Testing 
Service, 1952). A procedure for approximating the bi-serial r when item analysis 
data for the high-achieving and low-achieving halves of the class are compared is 
given in Figure 10.3 of this textbook. v 


Teacher-Made Tests 355 


distribution with respect to content areas and objectives. An average 
bi-serial r of .40 is considered adequate, while .50 is exceptionally high. 

In selecting the high-scoring and low-scoring groups for comparison, 
the customary procedure is to select the highest and lowest 27 percent of 
the group, because this percentage has been found to work most effec- 
tively when results are analyzed for a fairly large group of students. How- 
ever, teachers usually work with groups of 50 students or fewer; hence it is 
best to compare results for the high-achieving and low-achieving halves of 
the group, because the results have greater reliability (that is, tend to 
vary less from one sampling of students to another). 

After considerable experimentation, Diederich* has developed pro- 
cedures for having students help in item analysis by a show of hands. He 
Teports that these procedures can be used over a wide grade span. Even 
fourth-grade classes do not find it too difficult, while college students do 
Not resent the use of class time. In fact this use of class time can be justi- 
fied in that the remainder of the period can be used in discussion of the 
questions most frequently missed. The procedures may be summarized 
as follows: 


teacher records the distribution of 
ust been scored in class, the teacher 
а with the highest, while a student 
d for each score. 

dian or midscore. All papers are 
г below the midscore. Papers in 


1. After the papers have been scored, the 
Scores on the chalkboard. If tests have j 
can call off successive scores, beginnin 
2 резаи counts the number of hands raise 
+ The teacher then quickly finds the me 
Collected and quickly sorted as above, at, © e 
the high-scoring half of the class are passed to one side of the room (say 
the left side), the low-scoring papers to the other side. One-half of the papers 
in the median interval are assigned to each side. Thus, each student has a 
Paper, and presumably no student has his own. (If the total number of 
Students in the class is an odd number, one of the papers in the median 
Interval is excluded.) 

у For each item, the teacher records on th 
lowing each item on their papers) four figures, 
defined as follows: 


e board (and students record fol- 
which are labeled and 


H L H+L H-—L 
(The number _ (Thenumber Success Discrimination 
of highs who of lows who (The total High-low difference 
got the item got the item number who (how many more 
Tight) right) got the item highs than lows got 
right) the item right) 


For example, the teacher calls “item 1.” Everyone whose paper has the item 
tight raises his hand. A student assigned to count the highs calls out the 


in Short-Cut Statistics for Teacher-Made 


^' Paul B. Diederich, “Item Analysis,” 4 
es No. 5 (Princeton, N.J.: Educational 


а Evaluation and Advisory Service Seri 
esting Service, 1960), pp. 1-10. 


356 THE IMPROVEMENT OF INSTRUCTION 


number of upraised hands in the high section; then the counter for the lows 
calls out the number of hands in his group. The teacher (or student score- 
keeper) calls out the total and then the difference. For example, if there 
were fifteen highs and ten lows for item 1, the sequence would be: “item 
1, hands [pause for counting] 15-10-25-5; item 2... .” 


The items with the highest success values are the easiest for this group. 
If the teacher wishes, he can easily change these to percentages for his 


own use by multiplying by the constant, Шр For example, if there were 
n 


40 in the class, each success value would be multiplied by ^s or 2.5. The 
success value for this illustrative item would be 2.5 x (25) or 63 percent. 
If the test is designed to rank pupils (as a basis for grading), inclusion 
of too many easy items is not good use of testing and scoring time. On 
the other hand, very difficult items should be examined for ambiguity, 
especially if failed by a large percentage of the high group. 

The minimum high-low difference corresponding to the standards of 
professional test construction would be 10 percent of the group, or four 
pupils in a class of 40. However, because of the variation from class to 
class in student performance on items, we would want to examine, rather 
than discard, items with high-low differences smaller than this. An item 
can frequently be improved by substituting options that will attract more 
choices from poorly prepared students. The teacher must not routinely 


discard items that fail to achieve a Specified discrimination value without 
considering the effect on the test's balance of i 


(according to his own table of specifications), 


If time permits, the show-of-hands procedure can also be used for the 
second stage of item analysis, that is, the number of students choosing 
each option. If there is not time, this Stage (which involves only those 


items that showed low or negative discrimination values)? can be com- 
pleted by the teacher at home. 


tems by content areas 


Which of the following contains the most heat? 


Highs Lows 
a. A red-hot branding iron 0 10 
b. A lake full of water 20 8 
с. А car engine that has been driven in the heat 0 0 
d. A pail of boiling water29 о 2 


28 Negative discrimination values are Obtained wh 
scoring students get an item correct. 

29 Adapted from Gilbert Sax, The Construction 
Psychological Tests (Madison, Wis.: 
1962), p. 47. 


en more low-scoring than high- 


and Analysis of Educational and 
College Printing and Typing Company, Inc» 


Teacher-Made Tests 357 


Obviously option a is a good distractor; however, option c has drawn no 
choices. Since the terms “red-hot” in option а and “boiling” in option d 
may have drawn choices to those options, one might substitute for option c 
such a distractor as “A barrel of hot tar.” This type of item analysis may 
also help the teacher in group diagnosis, that is, in his understanding of 
misconceptions held by members of the class. Sometimes distractors can 
be written to represent different types of common errors and misunder- 
standings; for example, the distractors in arithmetic problems can represent 
types of procedural errors frequently made by students. 

If the use of class time in item analysis cannot be justified, or if the 
test is a final examination, which is scored after the last class session, item 
analysis by machine can be used; or the teacher can use a short-cut 


Procedure devised by Katz." 


TEACHER COOPERATION IN TEST DEVELOPMENT 


made tests is greatly facilitated if the 
п а school or school district cooperate 
st a file of test items, in each subject 


Progress toward improving teacher~ 
teachers of multiple-section courses i 


in the development of tests, or at lea aids 
field. There are many advantages in teacher cooperation in the develop- 


Tent of tests, Far better tests are likely to result from cooperative dis- 
cussion of the objectives of a course and the types of items that might 
be eflectively used in measuring student growth toward such objectives. 
Then, if the work can be divided on the basis of subject areas or objec- 
tives, each teacher can concentrate on writing and revising a smaller 
Number of test items than he would need for an entire end-of-course 
examination. When the test items have been written and copies of all 
items distributed to the teachers, each teacher should “take the test.” Dif- 
ferences of opinion regarding the keying of items will help to discover 
ambiguous questions before they are actually used. — — 

Another significant advantage of teacher cooperation is that the results 
ОЁ item analysis can be pooled. The item difficulty and discrimination 
Values obtained through pooling results from several classes are much 
More reliable than could be obtained by one teacher. If teachers do not 
Wish to have a common departmental test, they can select their own items 
from a master file and record their item analysis data for each item, as 


shown in Figure 10.3. 


by Means of Item Analysis,” The 


°° Martin Katz, “Improving Classroom Tus] А5 


Clearing House, vol. 35 (January 1961), pP- 26 


358 THE IMPROVEMENT OF INSTRUCTION 


U S HISTORY 1.12 
TERRITORIAL 
EXPANSION 


Item: 'Texas became part of the United States by 


a. purchase from France 

b. purchase from Spain 

c. treaty with Spain after the Spanish-American War 
d. request of the people of Texas 


Difficulty (% success) 10%, 60%, 75% 

Estimated bi-serial ғ 339, 42, 36 

Number choosing each option: а bc а 
High-scoring half 2 18 
Low-scoring half 3 3 1 13 


Fig. 10.3 An Item Card for a Departmental Item File 


= А А — 


* If the performance of students in the high-scoring and low-scoring halves of the 
class (above and below the median on total test score) is compared, the bi-serial 7 
for items of moderate difficulty is approximately equal to three times the high-low 
difference (expressed as proportion of group). For example, if the high-low differ- 
ence is 10 percent or .10, the bi-serial r, or discrimination index, would be 
3 X .10 or .30. If the high-low difference is 15 percent or .15, the index would ђе 
3 X .15 or .45. This estimate is approximately correct for items which 20-80 percent 
of the students answer correctly. For Very easy or very difficult items, this approxi- 
mation underestimates the index. For an йет оп which more than 80 percent suc- 
ceed, a high-low difference of 5 percent is acceptable; whereas for other items, the 
difference should be at least 10 percent, with the Corresponding index being at least 
-30. Because of the sampling error involved in Studying small groups, items with 
low discrimination indexes on one trial may have an index above .30 with another 
class. Ordinarily items with low indexes should be 
tuting a better distractor for one which is not attracting the choices of low-scoring 
students. The methods of making an item analysis of a test by a “show of hands” 
in class, together with short-cut procedures for analyzing the data, are given in 
Paul B. Diederich, Short-Cut Statistics for Teacher-Made Tests, Evaluation and Ad- 
visory Service Series No. 5 (Princeton, N. J.: Educational Testing Service, 1960). 


Single copies are available free on request to the Educational Testing Service, 20 
Nassau Street, Princeton, New Jersey. 


revised, for example, by substi- 


———————————____ Z yțșuo 


An intermediate approach (intermediate between free-lancing and à 
common departmental test) could be used. Such an approach (that is, the 
use of an "anchor test") would allow for variations in course content 
which may not only be legitimate but desirable; yet it would achieve some 
of the advantages of a common test. Teachers can usually agree on at 


Teacher-Made Tests 359 


least one-half of the items of their end-of-course examination, which could 
then constitute an anchor test. The other half of the items could be selected 
ог devised by the individual teacher. A tabulation of scores for each 
class on the anchor test would help teachers to modify their grading dis- 
tributions from class to class (in terms of any significant deviation between 
a class distribution of anchor test scores and that for all students combined). 

For a file of test items to be of maximum usefulness, each card should 
contain not only the item itself and the keyed answer but (1) classifica- 
tion of the item by content area and by objective and (2) the cumulated 
results of experience with the item. An illustrative test item card is shown 
in Figure 10.3. The notation in the upper left-hand corner indicates the 
Subject, and content area within the subject; while the number in the upper 
right-hand corner indicates that the item involves *1.12 Knowledge of 
Specific Facts," a subcategory of the taxonomy discussed in the next chapter. 


PROVIDING LEADERSHIP IN THE DEVELOPMENT OF 
TEACHER-MADE TESTS AND OTHER AIDS TO EVALUATION 


Since it is obvious that standardized tests can do only a portion of the 
job of assessing student growth, administrators and supervisors have the 
further responsibility of helping teachers learn to develop better teacher- 
made tests. 

Ebel has listed seven serious errors that teac 
their evaluation of pupils’ educational attainments. 


hers frequently make in 


own subjective, but presumably 
ubjective judgments have been 
have been persuaded to 
or to recognize 


First, teachers tend to rely too much on their 
absolute, standards . . . the unreliability of 5 
demonstrated over and over again. Yet few teachers 
use pooled judgments in cooperative test construction . . - 
their inevitable use of relative standards in evaluating student attainments. 

Second, teachers tend to put off test preparation to the last minute... . 

Third, many teachers use tests which are too poorly planned, too short, or 
too inefficient . . . to sample adequately all the essential knowledge and abilities 


In the area of educational attainment covered by the tests. exi у 
Fourth, teachers often place too much emphasis on trivial . . . details . . . 

to the neglect of basic principles, understandings, and applications. — 

. Fifth, teachers often write questions, both essay and objective, whose effec- 

liveness is lowered by ambiguity, or by irrelevant clues to the correct тезропзе. 

Too seldom до they seek even one independent review of their questions by a 


c 
9mpetent colleague. . у 
Sixth, many teachers overlook, ог underestimate, the magnitude of the 
Differences as small as one score 


sampli У 
ampling errors which affect test scores. - - © А 5 
"nit are considered to reflect significant differences in attainment. 


360 THE IMPROVEMENT OF INSTRUCTION 


Seventh, most teachers fail to test the effectiveness of their tests by even a 
simple statistical analysis of the results from the test. . . . There is no better 
way to develop skill in testing than to analyze systematically the results of 
previous efforts.*! 


Preparation of teachers for their inescapable responsibility of evaluating 
student achievement should be begun at the preservice level. However, 
proficiency in the practical art of educational measurement requires con- 
siderable "learning by doing" on the job. 

Ebel has observed hundreds of teachers at work on committees charged 
with developing new tests for the Educational Testing Service; he has 
studied the reactions of these teachers and the test specialists who work 
with them. On the basis of his experience in seeing teachers grow in their 
interest and competence in test development, Ebel’? contends that highly 
effective in-service education in measurement results when a group of 
teachers work together on (1) developing specifications for a test they 
all need, (2) writing items and revising each other's items, (3) preparing 
the test for use, (4) trying it out, and (5) analyzing the results. 

At several crucial points in their cooperative work (probably at inter- 
vals of 4—6 weeks), a specialist in test construction could be of considerable 
assistance to such a group of teachers. If the school district did not have 
such a specialist on its staff, periodic visits by a consultant would be а 
justifiable expenditure. Teacher concern about the development of more 
adequate tests inevitably leads to concern about the appropriateness and 
clarity of objectives, the quality and relevance of educational experiences, 
and the effectiveness of the teaching process. Such concern is an essential 
prerequisite to improvement in curriculum and instruction. 

The Cooperative Test Division of Education 
pared a series of color filmstrips, with synchr 
teacher workshop on *Making Your Own Tes 
25 minutes each) is focused on a maj 
"planning," “construction,” and 
in the kit are 28 ditto m 


al Testing Service has pre- 
onized sound, for use in а 
ts." Each filmstrip (running 
ог step in the process, for example: 
"analysis" of classroom tests. Included 
asters that summarize and amplify the major 
points of each film.** Each school district can duplicate the number of 
copies it needs for its own use. The Cooperative Test Division has also 
produced five 15-minute, 16mm. sound films for use in training programs 
for counselors and administrators (or with teachers if a qualified dis- 
cussion leader is present). The films are not suitable for use with lay 


31 Robert L. Ebel, “Improvin 
urement,” The Clearing House, 
3? Ibid., pp. 69-70. 


33 Further information concerning this kit may be obtained from the Director of 


Educational Relations, Cooperative Test Division, Educational Testing Services 
Princeton, N.J. 


g the Competence of Teachers in Educational Meas- 
vol. 36 (October 1961), pp. 67-71. 


Teacher-Made Tests 361 


groups. The titles are “Selecting an Achievement Test," “Administering a 
Testing Program,” “Interpreting Test Results Realistically,” “Using Test 
Results,” and “The Public Relations of Testing.” These films may be rented 
or purchased. 


SUMMARY STATEMENT 


If a teacher-made test is to be used to grade students, or rank them with 
Tespect to total score, the test should (1) be based on a representative sampling 
of the content studied, (2) be based on a representative sampling of the abilities 
Or skills emphasized in the course, (3) contain a sufficient number of questions 
SO as to have adequate reliability, and (4) include items covering a wide range 
of difficulty. If a test is designed to serve special instructional purposes rather 
than to rank students on the basis of scores, the criteria listed above are less 
applicable, For such tests, the following criteria are more significant: (1) Does 
the test elicit from the students the desired type of mental processes? (2) Does 
the test encourage the development of desirable study habits? (3) Does the test 
lead to improved instructional practice? (4) Does the test foster wholesome 
relationships between teachers and students? . | 

Although essay examinations have the advantages of being easily prepared, 
Teducing the amount of guessing, and stimulating superior study methods, they 
favor Students with linguistic ability and have relatively low reliability and 
Validity, If skillfully constructed, however, they may give the student opportu- 
nity to demonstrate his ability to select and organize his learnings and may 
Provide the teacher with a basis for diagnosing errors in interpretation and 
Concept formation. : у 

jux rs А or selection-type test questions makes it possible to test 
а large sampling of learnings in a relatively short testing time. This extensive- 
Ness of sampling and the objectivity of scoring contribute to the relatively 
ligher Teliability of the objective test. Objectivity of scoring also reduces scoring 
time and introduces the possibility of test scoring by students or clerical 
Workers, or by a test-scoring machine. Specific suggestions for the construction 
pe сасћ type of item were given, as well as a summary checklist for discovering 
aulty test item: her errors in test construction. E = 
ractical Бене = were given for editing test items, er =n онон 
Sveloping answer sheets, and making an item analysis of student responses to 
test items, 


SELECTED REFERENCES 


BEARD, RICHARD L., “Techniques the Teacher May Use in Constructing Tests,” 


i 1-106 
High п 1. 36 (January 1953), рр. 10 106. 
ое на VA of шош Objectives, Handbook I: 


Cognitive Domain. New York: David McKay Company, Inc., ird " 
Conran, HERBERT S., "The Experimental Tryout of Test Matenals, in E. F. 
Lindquist, ed., "Educational Measurement. Washington, D.C.: American 


Council on Education, 1951, Chapter 8. Я ^ 
PAVIS, FREDERICK в., “Item Analysis in Relation to Educational and Psychologi- 


Cal Testing,” Psychological Bulletin, vol. 49 (March 1952), pp. 97-121. 


362 THE IMPROVEMENT OF INSTRUCTION 


DIEDERICH, PAUL B., “Making and Using Tests," English Journal, vol. 44, 
(March 1955), pp. 135-140, 151. 6 Й 

EBEL, ROBERT L., "Procedures for the Analysis of Classroom Tests," Educa- 
tional and Psychological Measurement, vol. 14 (Summer 1954), pp. 352- 
364. 

ENGELHART, MAX D., "How Teachers Can Improve Their Tests," Educational 
and Psychological Measurement, vol. 4 (Summer 1944), pp. 109-124. У 

FRENCH, WILL, AND ASSOCIATES, Behavioral Goals of General Education in High 
School. New York: Russell Sage Foundation, 1957. А 

FURST, EDWARD J., Constructing Evaluation Instruments. New York: David 
McKay Company, Inc., 1958. . 

GERBERICH, J. RAYMOND, Specimen Objective Test Items. New York: David 
McKay Company, Inc., 1956, Parts I, II, and III. ка 

LENNON, ROGER T., "Testing: Bond or Barrier between Pupil and Teacher, 
Education, vol. 75 (September 1954), pp. 38-42. w^ 

LINDQUIST, E. F., "Preliminary Considerations in Objective Test Construction, 
Educational Measurement. Washington, D.C.: American Council on Edu- 
cation, 1951, Chapter 5. 

NOLL, VICTOR H., "Objectives as the Basis of АЛ Good Measurement,” Intro- 
duction to Educational Measurement. Boston: Houghton Mifflin Company: 
1957, pp. 90-107. 

VAUGHN, к. W., "Planning the Objective Test," in E. Е. Lindquist, ed., Educa- 


tional Measurement. Washington, D.C.: American Council on Education, 
1951, Chapter 6. 


WEITZMAN, ELLIS, AND WALTER J. MCNAMARA, Constructing Classroom. Exam- 
inations. Chicago: Science Research Associates, Inc., 1949, Chapters 2-4. 
WOOD, DOROTHY ADKINS, Test Construction: Development and Interpretation of 
Achievement Tests. Columbus, Ohio: Charles E. Merrill Books, Inc., 1960. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. Summarize the advantages and limitations of essay-type tests in your sub- 
ject field or grade level. 

2. Summarize the advanta. 
ject field or grade level. 


3. Construct an essay-type examination consisting of five questions, and 
prepare an objective scoring key. 


4. Prepare five true-false and five multi 
this chapter, 
of items. 

5. Construct five matching-type items on the contents of this chapter, follow 
ing the suggestions given. Have the items evaluated by two or more of your 
classmates. 


ges and limitations of Objective tests in your sub- 


ч 5 ultiple-choice items on the contents of 
following the suggestions given for the construction of these type? 


6. Obtain an informal, objective teacher-made examination and evaluate ib 
using the checklist of errors given in this chapter. 


7. Construct an informal objective test of 50 items, and have it evaluated by 
a classmate. Revise the items found to be faulty. 

8. What are the advantages of maintaining a file of test items in your subject 
field? Of werking cooperatively with other teachers to maintain a departmenta 
file of this type? What facts should be recorded for each item? 


The Тахопоту 
ој Educational Objectives 
and Test Items Illustrative 
of Its Major Categories 


IH 


e complexity of educational objec- 
asuring student achievement was 
onal associates worked together 
omy of educational objectives, 
s in the cognitive areas could 


A great step forward in recognizing th 
tives and the difficulties involved in me 
taken when Bloom and several professi 
Over a period of years in developing а taxon 
under which educational goals and test item 
be classified.: 

The major categories of this taxonomy, and each of its various subcate- 
gories, are summarized in Table 11.1, together with illustrative objectives, 
Classified under each subcategory. A study of this table reveals that these 
Objectives are classified under six main headings, which represent increas- 
ing degrees of complexity. The objectives in each major class make use of, 
and are dependent on, the goals of the classes lower in the hierarchy; for 
example, the cognitive abilities classifiable under “3:00 Application” are 
dependent on the abilities classifiable under “2.00 Comprehension,” which 
in turn are dependent on those under “1.00 Knowledge.” As an aid to 
Students in understanding the taxonomy, we have included in this table one 
ог more objectives of a course in measurement for each of the 20 sub- 
categories, 


The six sections of this chapter are devoted to an explanation of each 
of the six major categories of the taxonomy and to the presentation of 


test items illustrative of each category. Not only will these test items serve 
to illustrate the taxonomy but they will provide examples of item writing 
for several subject fields. For additional suggestions for item writing in 
Various subject fields, the student is referred to à list of selected references, 
Classified by subject field, listed in the bibliography for this chapter. 


ti 1 Benjamin S. Bloom, Taxonomy of Educational Objectives, Handbook 1: Cogni- 
ive Domain (New York: David McKay Company, 1956). A similar volume de- 
Voted to the development of a taxonomy for objectives in the affective domain is 


neari 5 
Ting completion. 


363 


364 


THE IMPROVEMENT OF INSTRUCTION 


Table 11.1 


Examples of Course-of-Study Objectives Classified under the Headings 
of the Taxonomy of Educational Objectives: The Cognitive Domain 


ГЕ ——————— P MÀ 


1.00 KNOWLEDGE? (Remembering facts, terms, and principles in the form that 


, 


they were learned) 


1.10 Knowledge of Specifics 


1.11 


1.12 


Knowledge of terminology 
Know the terms used in the study of elementary chemistry 
Know the terms used for different types of converted scores 
Knowledge of specific facts 


Know the physical and chemical properties of common elements 
Know the major sources of information on published tests 


1.20 Knowledge of Ways and Means of Dealing with Specifics 


121 


1.22 


1.23 


1.24 


1:25 


Knowledge of conventions 


Know the ways in which symbols are used in writing equations in 
chemistry 


Know the conventional procedures followed in tallying data in à 
frequency distribution 


Knowledge of trends and sequences 


Develop a basic knowledge of the evolutionary development of man 
Know the major trends in the development of achievement testing 
Knowledge of classifications and categories 

Know the major branches of biological science 


Know the major classifications of the "Taxonomy of Educational 
Objectives: Cognitive Domain" 


Know the four different types of number systems 
Knowledge of criteria 


Know the criteria by which the nutritive value of a meal can be 
judged 


Know the major criteria to be used in evaluating tests for a specific 
purpose 
Knowledge of methodology 


Know the methods for estimating the size of distant stars 


Know the methods used in obtaining norming samples that are тер” 
resentative of high school students 


1.30 Knowledge of the Universals and Abstractions in a Field 


131 


Knowledge of principles and generalizations 
Know the biological laws of reproduction and heredity 


Know the principles involved in comparing reliability coefficients 
computed on different groups 


Taxonomy of Educational Objectives 365 


Table 11.1 (Continued) 
Examples of Course-of-Study Objectives Classified under the Headings 
of the Taxonomy of Educational Objectives: The Cognitive Domain 


--——— Á—— — —— —— RM 


1.32 Knowledge of theories and structures 
Know a relatively complete formulation of the theory of evolution 
Know a relatively complete formulation of the four types of validity 
and their relationships to the types of judgment to be made 


2.00 COMPREHENSION? (Understanding material studied without necessarily re- 
lating it to other material) 


2.10 Translation (from one set of symbols to another) 
Can prepare graphical representations of physical phenomena, or of ob- 
served and recorded data 


Can translate normalized standard scores into percentiles 
Can read percentiles, stanines, ог normalized standard scores from an 


Otis Normal Percentile Chart y И 
Can translate а formula into a verbal explanation or vice versa 


2.20 Interpretation (summarization or explanation) 
Can distinguish among warranted, unwarranted, or contradicte 


sions drawn from a body of data 
Can read the section on Reliability in a test 
data in his own words 


d conclu- 


manual and summarize the 


2.30 Extrapolation (extension of trends beyond data given) 


Can interpolate when there are gaps in the data А 
Can estimate the probable reliability coefficient for his own group on the 


basis of reliability data published in the test manual 


3.00 APPLICATION* (Using generalizations ог other abstractions appropriately in 


concrete situations) 
Can predict the probable effect of a change in a factor on a bi 
previously at equilibrium 
Can select, on the basis of a table of 
for the prediction of success in a spec! 


ological situation 


intercorrelations, the best pair of tests 
fic school or job situation 


400 ANALYSIS: 


4.10 Analysis of Elements А 
Can distinguish а conclusion from the statements that support it 
Can distinguish between facts and interpretations in reading anecdotal 


records or other case study materials 


4.20 Analysis of Relationships 
Can check the consistency 
assumptions 
Can recogniz 
particular use of a test 


of hypotheses with given information and 


е which methods of studying reliability are relevant to a 


366 THE IMPROVEMENT OF INSTRUCTION 


Table 11.1 (Continued) 
Examples of Course-of-Study Objectives Classified under the Headings 
of the Taxonomy of Educational Objectives: The Cognitive Domain 


4.30 Analysis of Organizational Principles 


Can recognize the general structure of a musical composition 

Can recognize the techniques used in persuasive materials, such as adver- 
tising and propaganda 

Can recognize the differences among the four types of validity with 
respect to their emphasis on the criterion 


5.00 SYNTHESIS¢ 


5.10 Production of a Unique Communication 
Can write simple musical compositions 
Can tell a personal experience effectively 
Can devise test items classifiable under each major division of the taxonomy 


5.20 Production of a Plan or Proposed Set of Operations 
Can propose experiments for testing hypotheses 
Can design simple machine tools to perform specified operations 
Can set up a table of specifications for a test on a specific unit of work 
Can devise a plan for the local validation of à test for a specific purpose 


5.30 Derivation of a Set of Abstract Relations 


Can discover and formulate generalizations in mathematics 


Can discover and formulate an appropriate set of categories to use in 


summarizing student responses to a free-response question, such as “What 
do you like best about your school” 


6.00 EVALUATION! (Judging the value of material for a specified purpose) 


6.10 Judgments in Terms of Internal Evidence (for example, logical consistency) 
Can indicate logical fallacies in arguments 
Can evaluate the adequacy with which research data in a test manual 
Support statements made about the value of the test for certain purposes 


6.11 Judgments in Terms of External Criteria 


Can make a comparative appraisal of two or more tests for a specific 
purpose on the basis of the criteria presented in Part I of the text and con 
sultation of expert opinion in the Buros Mental Measurements Yearbooks 
л a —"-———H a 


Source: The headings of this outline and some of the illust, 


i kat a- 
ue rative objectives are taken verb 
tim from Benjamin 5. Bloom, 


ed., Taxonomy of Educational Objectives: Handbook 1, Cognitive 
Domain (New York: David McKay Company, Inc., 1956), Part 11. Note that the taxonomy ЁЁ 
arranged in a hierarchy, that is, each classification within it utilizes the skills and abilities та! 


are lower in the classification order; for example, “Application” requires both “Comprehension” 
and ability to recall “Knowledge.” 


“In the taxonomy, the term “knowledge” is defined as including “those behaviors and te" 


situations that emphasize the remembering, either by recognition or recall, of ideas, material 


Тахопоту ој Educational Objectives 367 


Table 11.1 (Continued) 

Examples of Course-of-Study Objectives Classified under the Headings 
of the Taxonomy of Educational Objectives: The Cognitive Domain 
o ————— 
or phenomena. The behavior expected of the student in the recall situation is very similar to 
the behavior he was expected to have during the original learning situation." [Italics added.] 
Ibid., 62 

‚р. 62. 


b 
In the taxonomy, the term "comprehension" 
literal message contained in a communication without necessaril 


is used to represent “an understanding of the 
у relating it to other material. 


Three types of comprehension behavior are considered. . . . The first is translation, which 


means that an individual can put a communication into another language, 
. . The second type of behavior is interpretation, which 
ideas whose comprehension may 
the individual. . . . 


into other terms, or 


into another form of communication. . 
involves dealing with a communication as а configuration of 
require the reordering of ideas into a new configuration in the mind of 
The third type . . . is extrapolation. It includes the making of estimates or predictions based 
9n understanding of the trends, tendencies, or conditions described in the communication." 
lbid., pp. 89-90. 

“While comprehension requires that a student understand a generalization or other abstrac- 
tion well enough to illustrate its use, application requires that the student select the appropriate 
Seneralization or concept and use it in а situation in which no mode of solution is suggested. 
To test application, a test situation. must either be novel or must contain novel elements as 
Compared with the situation in which the abstraction was learned. Ibid., р. 120. 

“Analysis. is defined as "the breakdown of the moterlal Into its constituent ports and 
detection of the relationships of the parts and of the way they are organized. . . . Analysis . . . 
may be divided into three types or levels. At one level, the student is expected to break down 
the material into its constituent parts, to identify or classify the elements of the communica- 
tion. At a second level, ће is required to make explicit the relations 
++. their connections and interactions. A third level involves recognition of the organizational 
Principles, the . . . structure of . . . the communication as а whole.” Ibid., рр. 144-145. 

“Synthesis is defined as “the putting together of elements and parts so as to form a 
Whole, . . . combining them in such a way as to constitute а pattern or structure not clearly 
there before, Generally, this would involve а recombination o 
with new material, reconstructed into а new and more or less well-integrated whole." Ibid., 
P. 162. 

© Evaluation is defined as the making of c 
5 port of ideas, works, solutions, methods, 
classification, only those evaluations that are or can 
“ге considered.” Ibid., рр. 185-186. 


hips among the elements 


f parts of previous experience 


onscious judgments about "the value, for some 


materials, and the like. . . . For purposes of 
be made with distinct criteria in mind 


1.00 KNOWLEDGE 


"Knowledge," as defined in the taxonomy, includes those objectives and 
lest situations that emphasize the remembering of facts and ideas. In a 
knowledge item, the student is required either to recall or recognize what 
he has learned—in exactly, or almost exactly, the way he originally learned 
‘t. The questions will usually be posed in a somewhat different form than 
in the textbook, but the testing process is chiefly one of checking on the 
Completeness and accuracy of the student’s memory for material learned. 


368 THE IMPROVEMENT OF INSTRUCTION 


Test items that measure student knowledge outnumber all others in most 
standardized achievement tests and most teacher-made tests. Gerberich* 
assembled 227 test exercises illustrating every type of test item in current 
use. His collection was far more varied than would be found in a random 
sampling of standardized or teacher-made tests. Yet a later study of all 
these items, in terms of the taxonomy, revealed that 51 percent were 
classifiable in the knowledge category? Undoubtedly, one of the chief 
reasons for the predominance of knowledge items is that they are compara- 
tively easy to construct. 

Knowledge is classified under three headings: (1.10) Knowledge of 
Specifics, (1.20) Knowledge of Ways and Means of dealing with specifics, 
and (1.30) Knowledge of the Universals and Abstractions in a field. 


1.10 Knowledge of Specifics 


Here we are concerned with the student's memory of specific items of 
information, which have meaning and value in themselves. They often rep- 
resent basic elements that the student must know if he is to become ас- 
quainted with a field and solve problems within the field. “Knowledge of 
Specifics” is divided into two subcategories: (1.11) Knowledge of Тег- 
minology and (1.12) Knowledge of Specific Facts. 


1.11 KNOWLEDGE OF TERMINOLOGY Each subject field utilizes a large 
number of terms and symbols that constitute the “shorthand” used in сот- 
munication. A student is unable to think and work effectively in the field 
unless he can make use of the most essential of these verbal and nonverbal 
symbols. 

Multiple-choice items can be used very effectively for questions on ter- 
minology. For example, students may be asked to identify the names of 
Structures in actual animals. A frog may be dissected so that the heart is 
exposed; a numbered string may be attached to each of the various heart 
structures and the students asked to state the name and function of each 
structure so labeled. The structures to be identified may be listed beside 
the diagram, and the student asked to indicate his responses by placing 
the correct letter or number after each structure in the list, The element of 


guessing may be reduced by numbering more structures than are given In 
the list. 


» Raymond Gerberich, Specimen Objective Test Items (New York: David McKay 
Company, Inc., 1956). б 
з Julian C. Stanley and Dale L. Bolton, book review of the “Taxonomy of Edu- 


E ut ТА Educational and Psychological Measurement, vol. 17 (Winter 


Taxonomy of Educational Objectives 369 
HEART OF FROG (VENTRAL VIEW)* 


Structures: Soré 
a. Carotid artery сс: у 
b. Left auricle 4 
с. Post caval vein 8 
d. Pulmocutaneous artery 10 
е. Right auricle 9 
f. Superior vena cava 7 
9. Systemic artery 11 
h. Truncus arteriosus 3 
1, 


Ventricle 


i i i i ify parts 
This type of exercise may also be used in a science to identify p: 
| 1 1 ircui like. 
of a machine, electrical circuit, and the | | "T 
The more conventional type of matching exercise E bee 
testing knowledge of terminology. A list of definitions br dla Ede ce 
by a list of terms (longer than the list of definitions, to 


ik fused 
guessing). The list of terms should include some that are likely to be confu 


by the inadequately prepared student. The following example’ is illustrative: 


i i t A. Hydrogenation 
187. The process in which electrons are gained in the outermosi 


B. 

orbit of the atoms of the element. l ions in 

188. The process in which the en ane M s do [e 

; ји and the o! 

Io uel to form water D. Oxidation 
orm a salt. termost 

The process in which electrons are lost from the ou E 

orbit of the atoms of the element. 


. lonization 


. Neutralization 


ies . Reduction 


t fields include a number of items on 


Standardized tests in special os d рүш exieaples rom: the 


the understanding of terminology, suc 
Anderson Chemistry Test." 


(1) The valence of an element tells 
(a) its atomic weight. 
(b) the solubility of its compounds. 
(c) its stability. 
(d) how many electrons its atom lends, borrows, 
(e) how many compounds can be formed. 


or shares. 


i is called 
(2) Any solution which conducts an electric current is са 
(а) ап ion. (d) an electrode. 
(b) an electrolyte. (e) a catalyst. 
= Honi C. R. Mann, The Construction and 


“Herbert E, Hawkes, E. F. Lindquist, and e "Min. Company; ДО), 
Hought 


5€ of Achievement Examinations (Boston: 
ве MEN, «Evaluation of Achievement in Chemistry," Journal of 
Chemical poe Lge А 28 (July 1951), РР. a 1950 by Harcourt, 
ae Бо | рта in Great Britain. All rights ге 


“erved. Reproduced by special permission. 


370 THE IMPROVEMENT OF INSTRUCTION 


The first item illustrates the technique of having the student choose the 
best definition for a given term in chemistry; in the second item, the defini- 
tion is given, and the student is merely required to match it with the correct 
term. Items of the first type ordinarily require a higher level of under- 
standing of the term or concept. Both types, however, require recognition 
only. For this reason, teachers may prefer to have students define essential 


terms in their own words, even though the scoring of such student defini- 
tions must be subjective. 


1.12 KNOWLEDGE OF SPECIFIC FACTS Under this heading can be in- 
cluded test items that involve knowledge of important dates, events, per- 
sons, and places which help the student in thinking about specific problems 
or topics. Facts differ from terminology in that terminology usually rep- 
resents the terms and symbols that authorities have agreed to use, whereas 
facts can be verified by means other than a consensus among workers 10 
the field. 

The term “specific” is used to differentiate facts that can be known 
as discrete elements from those that have meaning only in a larger context. 
In this sense, approximate information, such as the approximate time span 
covered by the Reconstruction Period, would be included. Knowledge 
about specific sources of information, such as the Buros Yearbooks for a 
student of measurement, would also be included in this category. 

An attempt should be made in selecting published tests or in develop- 
ing local tests, to measure knowledge of facts that are relevant to important 
concepts or principles. For example, the two questions below are relevant 


to the concept that there is strength in unity and that disunity weakens 2 
people in its defense against its enemies. 


1. One of the chief reasons for the fall of Greece was the fact that: 
a. the Greeks were not good fighters 
b. the city-states would not work and fight together 
c. Greece was a country which could be easily invaded 
d. the Greek cities had no forts or protecting walls 


2. Which one of the following was a reason for the white man's success in taking te" 
ritory from the Indians? 


The Indians were: 

a. not experienced fighters 

b. often willing to fight on the side of the white men against other Indian tribes 
c. not able to ride or shoot as fast as the white men 

d. too few in number to fight the colonists successfully? 


7 Reprinted with the permission of the California Test Bureau from Georgia Sachs 
Adams and John A. Sexson, California Tests in Social and Related Sciences. ІЁ“ 


mentary, Part I, Form AA (Monterey, Calif.: California Test Bureau, 1953), items 
18, 50. 


Тахопоту ој Educational Objectives 371 


Negatively stated questions can often be used to advantage in testing 
knowledge of specific facts. In some areas, it is extremely difficult to devise 
three or four plausible distractors. It may be much easier and more satis- 
factory to construct some negatively stated questions that will challenge the 
students. 


(1) АП of the following secrete hormones EXCEPT the 
(d) adrenal glands 


(a) pituitary gland 
b ak (e) islets of the pancreas’ 


(b) lymph nodes 

(c) parathyroid glands 
(2) The only source of heat given below that may NOT produce carbon monoxide is 

(a) gasoline (d) coal 

(b) electricity (e) oil 

(c) kerosene 
(3) Four of the following were commonly included as qualifications for voting during the 

colonial period. Which one was NOT? 


(а) Male sex 
(b) Ownership of property 
(c) Membership in the state church 


(d) Ability to read and write 
(e) Status of "free man”? 


1.20 Knowledge of Ways and Means of Dealing with Specifics 


In this subdivision of the taxonomy, we are concerned with m Fw 
dent’s knowledge of the ways in which man has learned to еи a 
study, and criticize facts and ideas. This is P ан pee. o ie 
haviors because we are focusing on the knowledge o À Д 

| s cM i The objectives and test 
than their actual application to specific problems. s 
items under this Fadi are classifiable into five groups, as shown in 
Table 11.1. 


1.21 KNOWLEDGE OF CONVENTIONS Under this subclass ind 
the usages, styles, and practices that are agreed аз Ар Im 
Examples would include the rules used in typing p a aa reni 
Taphies to insure consistency of style, conventional rules of socia , 


і lish. 
and accepted usages in spoken and written Eng 


One of the basic problems in testing knowledge of such conventions as 


m 7. Copyright 1950 by Harcourt, Brace & 


8 Nelson Biology Test, Form Bm, = pes P iain, АШ нөн reserved, Ree 


World, Inc, New York, N. У. Copyrig 
Produced by special permission. 

9 Read General Science Test, Form Bm, ier 
Brace & World, Тас, New York, №. Y. Сору! 


Served, R d by special permission. 1 . = | : 
Б еркт ль fie pentium of the Educational Testing Service from Coop: 


i i N.J.: Educational 
тата "American Government Test, Form X, item 22 (Princeton, 
esting Service, 1947). 


item 2. Copyright 1950 by Harcourt, 
ght in Great Britain. All rights re- 


374 THE IMPROVEMENT OP INSTRUCTION 


Social trends 


(1) Which statement best describes the change in the power of state and federal gov- 
ernments during the last twenty-five years? 


(a) No important increase or decrease of power is noticeable. 

(b) Both state and federal powers have decreased. 

(c) State power has increased; federal power has decreased. 

(d) Both have increased in power, but the power of the states has increased more 
rapidly. 


(e) Both have increased in power, but the power of the federal government has 
increased more rapidly.1+ 


(2) On your answer sheet, mark the number of your answer 
(1) if higher in 1950 than in 1900; 
(2) if lower in 1950 than in 1900. 
62. Percent of people over 65 years of age. 
63. Death rate (deaths per 1000 population). 
64. Divorce rate (divorces per 1000 population). 
65. Percent of people living in cities.15 


Cause-effect relationships 
(1) Three of the following were causes of the conflict between England and Spain in 
the New World. One was a result. Which was the result? 


(а) growth of foreign trade of both England and Spain 
(b) supremacy of England on the sea 

(c) English slave trade with Spanish colonies 

(d) capture of Spanish galleons by Sir Francis Drakes 


(2 


What conditions contributed to the economic depression of the early 1930s? Choose 
a, b, c, d, or e. 

(1) The lack of farm prosperity in the 1920s. 

(2) The decline of foreign markets after World War |, 

(3) The lack of purchasing power of low-income groups. 

(4) The large military budgets of the 1920s. 

(5) The lack of industrial capacity and natural resources, 


о2о са 
- У == 


14 Reprinted by permission of the Educational Testing Service from Cooperative 
American Government Test, Form X. 


» item 58 (Princeton, N.J.: Educational Test- 
ing Service, 1947). 


15 Dimond-Pflieger Problems of Democracy Test, Form Am 
right 1953 by Harcourt, Brace & World, Inc. 
Britain. All rights reserved. Reproduced b 


?5 Reprinted with the permission of The California Test Bureau from California 


Tests in Social and Related Sciences, Advanced, Form AA, Part I, item 19 (Mon- 
terey, Calif.: California Test Bureau, 1954). 


27 Crary American History Test, Form Am, 
Brace & World, Inc., New York, N. Y. Cop 
served. Reproduced by special permission. 


, items 62-65. Copy- 
» New York, N. Y. Copyright in Great 
У special permission. 


item 88. Copyright 1950 by Harcourt, 
yright in Great Britain. АП rights T€ 


Taxonomy of Educational Objectives 375 


1.23 KNOWLEDGE OF CLASSIFICATIONS AND CATEGORIES As a subject 
field becomes well developed, scholars develop classifications and categories 
that may seem arbitrary to the beginning student but that are fundamental 
to further work and study in the field. 

Recall questions might be concerned with knowledge of the types of 
reliability coefficients, the subdivisions of the Dewey Decimal system, the 
headings of the taxonomy, the various classifications of jobs within the 
engineering profession or some other occupation, or the logical subdivi- 
Sions within the biological sciences. 

Test items that require the student to classify unfamiliar phenomena 
into categories would not fall in this subdivision. For example, an item 
testing the student's recall of the headings of the taxonomy would be classi- 
fiable here; whereas one requiring the classification of test items according 
to the taxonomy would fall under 3.00 Application, to be discussed later. 

The following three examples illustrate the type of objective items that 


would be classifiable under this heading: 


(1) In all fairly complex animals the skeleton and the muscles are developed from the 
Primary germ layer known as the 
4, endoderm 


1. ectoderm Р е 
. mesoderm 


2. пеџгосоеје 
3. epithelium 


(2) Which of the following is a chemical change? 


1. Evaporation of alcohol 
2. Freezing of water 
3. Burning of oil 


4. Melting of wax 
5. Mixing of sand and sugar!* 


(3) A triode radio tube differs from a diode tube in that it has: 
(d) a grid 


(a) a 
Tower See (е) an inert gas within the tube? 


(b) a cathode 
(c) a shield 


all the knowledge as taught, or 


In each case, the student is required to rec 
o apply the system of 


to recognize a textbook interpretation, rather than t 
Categories in a new situation. 


This subclass is similar to the preceding 
found useful by specialists in the field 
iteria rather than their applica- 


1.24 KNOWLEDGE OF CRITERIA 
One in that (1) it involves knowledge 
and (2) it includes only knowledge of cr 
Чоп to new problems. 

18 Benjamin S. Bloom, ed. Taxonomy of Educational Objectives, Handbook I: 
а 3 , ed, Та: 


Cognitive Domain (New York: David McKay Company, Inc., 1956), p. 83. 
19 Dunning Physics Test, Form Am, item 75. Copyright 1950 by Harcourt, Brace 


& World, Inc. New York, N. Y. Copyright in Great Britain. АП rights reserved, 
Reproduced by special permission. 


376 THE IMPROVEMENT OF INSTRUCTION 


Since considerable emphasis has been placed in this textbook on knowl- 


edge of criteria for test selection, two illustrative items are given from this 
subject area. 


1. What is characteristic of a highly reliable test? 


(A) Two scores for the same examinee agree closely with each other. 

(B) Good students make much higher scores on the test than poor students. 
(C) There is a uniform distribution of scores on the test. 

(D) The items in the test vary widely in difficulty. 


БУ 


In general, low reliability is associated with а 


(A) relatively large error of measurement 
(B) relatively small error of measurement 
(C) below norm performance 

(D) non-representative norms 

(E) relatively high validity2o 


1.25 KNOWLEDGE OF METHODOLOGY Kn 
cedures employed in a subject field is an im 
of such methodology in studying new pro 


type of inquiry, the student should know the techniques that have been 
effectively used for the study of similar problems. 


Again examples from the measurement area are used. 


owledge concerning the pro- 
portant prerequisite to the use 
blems. Before engaging in any 


(1) In the scoring of essay examinations, 
sirable practices except to 
(A) reduce the mark for poor spelling or penmanship 
(B) prepare a scoring key and standards in advance 
(C) remove or cover pupils’ names from the papers 


(D) score one question on all papers before going to the next 
(E) use the same standards for all pupils?1 


all the following are generally considered de- 


(2) Describe one method of making an ite 


| m analysis which can be used to study both 
item difficulty and item discrimination, 


1.30 Knowledge of the Universals and Abstractions in a Field 


Ideally, a student's knowledge of a field is not limited to specifics and 
ways of dealing with them, but goes beyond this to a knowledge of the 
major ideas and patterns by which these facts and ideas are organized. A 


20 Reprinted with the permission of the authors, Robert L. Ebel and Eric Е. 
Gardner, Multiple-Choice Items for a Test of Teacher 
Measurement (Ames, Iowa: National Council 9n Measurement in Education, 1962), 
p. 23. 


#1 Reprinted by permission of the publisher 


Saupe, Manual for Introduction to Educational Measurement (Boston: Houghton 
Mifflin Company, 1957). 


Taxonomy of Educational Objectives ST 


student who can recall many abstractions in a subject field, and can recall 
specific illustrations of them that have been studied in class, has the basis 
for relating and organizing a great many specifics. As a result, he tends to 
gain greater insight into large units of subject matter and to show greater 
retentiveness for both the generalizations and the supporting facts. ч 

In this category of objectives and test items, we have only two major 
subclassifications: (1.31) knowledge of principles and generalizations and 
(1.32) knowledge of theories and structures. 


1.31 KNOWLEDGE OF PRINCIPLES AND GENERALIZATIONS Here we are 
concerned with abstractions that help us to describe, explain, or predict 
phenomena. The test items are ones that require (1) that the student know 
the principles and generalizations in the sense that he can recognize or re- 
call correct versions of them, ог (2) that the student recall or recognize 
Specific illustrations of these generalizations that have been used in the 


textbook, or other instructional materials. 
Examples: 


(1) When a gas is heated at constant volume, its pressure increases because 


ough their speed remains constant 


(A) its molecules increase in size alth 
g the walls of the container 


(B) the average change in momentum per molecule hittin 
increases 

(C) Charles' law is true 

(D) its molecules increase in both size an 

(E) clusters of molecules break apart an 


d speed 
d fly about separately?? 
(2) The Constitution of the United States provided long terms for federal judges in order 
to 

(A) render it impossible for them to serve in other 


crue with frequent elections 
hout political influence 


ervice 


positions in the federal government 


(B) eliminate expenses that ac 
(C) facilitate the rendering of decisions wit! 
(D) make the federal bench an attractive career 5 
(E) secure continuity in the administration of justice?? 


1.32 KNOWLEDGE OF THEORIES AND STRUCTURES This subcategory 
Tepresents the highest level of abstraction of the entire Knowledge category. 
It differs from category 1.31 in that it involves knowledge of a body of 
interrelated principles and generalizations. In order to perform successfully 
On items in this classification, the student must have a clear and systematic 
Overview of some theory, such as the theory of evolution, or of some struc- 
ture, such as the structural organization of the city, state, or Federal 


government. 
lege Board Achievement Tests (Princeton, N. J.: The 


7? А Description of the Col 
Board, 1962), P- 142. 


College Entrance Examination 
#з Ibid., p. 108. 


380 THE IMPROVEMENT OF INSTRUCTION 


completion type, which require students to fill in specified data read from a 
graph or table, such as the mileage between two towns, the name of a capi- 
tal city, and the like. 


4 Below is given a map of a make-believe | 76. Which one of the following cities 
continent. There are five countries in this is the capital of Country 4? = 
continent, numbered 1, 2, 3, 4, and 5. Read D E F d 
each question below, and then use the |77 Which one of the following cities 
map to answer it. For each question, is the largest? 
mark the answer as you have been told. A i 


B G __п 
oF 78. Which city is slightly northeast 
of City H? 
K B D — 
79. Which one of the following cities 
is farthest from the equator? 
H E F ——%# 
80. The distance from City C to City 
H is about 


*100 miles ^ 50 miles ¢75 miles —— 99 


81. Between „Which two countries 
does a river form part of the 
boundary? 


41апд 2 *1and3 :2ап45 ——*? 


У The locations of certain physical features 
have been indicated on the map by smal! 
letters. Find each physical feature, then 
mark the answer as you have been told. 


82. Lake b x h === 
83. Delta 5 tf — 

© сан сте 

© сш се мю мышы 84. Isthmus a e k —* 

аа 85. Source of stream f g m — 


Ы ааа 
so ЕД LJ Test 3- бес. С Sc 
‘Seale el Miles Leiner ma 
Fig. 11.1 An Illustrative Test Item Involving Skill in Translation. 


Reproduced with the permission of the California Test Bureau from 
Georgia Sachs Adams and John A. Sexson, California Tests in Social and 


Related Sciences, Elementary, Form AA (Monterey, Calif.: California Test 
Bureau, 1955). 


A. third type of translation is from one verbal form to another, as in 
translating a communication from a foreign language to English, or from 


a poetic or dramatic form (with its symbolism and metaphors) to everyday 
prose. 


Taxonomy of Educational Objectives 381 


2.20 Interpretation and 2.30 Extrapolation 


The essential behavior in Interpretation is the comprehension of the 
major ideas in a communication and an understanding of their interrela- 
tionships. Interpretation goes beyond translating parts of a table, map, or 
other communication to a determination of the larger, more general ideas 
that can be drawn from it. One of the questions on the map, no. 79, went 
beyond translation in that the student was asked to consider the relation- 
ship between the various parts, to get a view of the map as a whole, and 
relate it to his own fund of concepts. That is, "interpretation" was involved 
when we asked the student “Which one of the cities is farthest from the 
equator?" 

In the following algebra problems, 
only the elements but their interrelationships in ord 
cessfully, 


the student must comprehend not 
er to work them suc- 


ented by the expression 9x? + бху + y*. 


a) The аге i ii 
а of a certain square is repres 
i d in terms of x and y??* 


What will be the perimeter P of this square expresse 


(2) IF h, k, m, and n are positive numbers, К is greater than m, and n is greater than 


h, which of the following is (are) true? 
1. n + h may equal k + m. 

ll. k + h may equal n + m. 

lll. k + п may equal m + h. 

(A) None (B) I only (С) | and | 


128 


1 only (D) | and III only (B) 1, I, and Il 


aking inferences from trends or rela- 


tionships in the data, requires a recognition that the inference involves 
Some degree of probability. Extrapolation involves e ош а rs 
Ple to a universe, from one situation to similar situations, from a trend in 
the past t vie 
ast to a prediction for the future. : : м 
It is often сједе to test for competency In extrapolation by devising 


i d to test the interpre- 
items or table that was use i 
on the same map, graph, ‘ective one, we can include some 


tation obiecti h ise is an obj 
jectives. If the exercr а 1 
generalizations that involve a sound extension of evident trends and шен 


that clearly involve overgeneralization. à 
In йй stations of data tests, the student can be пне ы 
Selection, table, chart, or map and asked to supply or recog pes E 
that can legitimately be made from the data. Loa! T be ie 
ferences should be based on the joint consideration of two or more ele- 


Ments in the communication. 


Competency in extrapolation, or m 


27 Hawkes, Lindquist, and Mann, ор. си, p. 367. 


i . cit, p. 131. 
#5 A Description ој the College Board "Achievement Tests, op cit., p 


к, 


382 THE IMPROVEMENT OF INSTRUCTION 


Directions: Below are some statistics relating to education and occupations. You are to 
judge what conclusions may be drawn from them. 


Occupational 


distribution found Distribution of 
in a sample of occupations in the 
male college population as a 
graduates* whole, 1950 
OCCUPATIONS PERCENTAGES 
Executives, minor officials, partners, proprietors 23.5 эл 
Professional workers 51.3 47 
Salesmen 6.0 Less than 196 
Skilled workers 71 33.8 
Clerical workers 87 13.4 
Unskilled workers T 26.1 
Farmers “7 13.0 
100.0 100.0 


* You may assume that the sample selected is representative of all male college grad- 
vates in the United States. 


Below are a series of statements relating to occupations and education. 
Blacken answer space 


A—if the foregoing statistics alone are sufficient to Prove the statement frue; 


B—if the foregoing statistics alone are sufficient to indicate that the statement is 
probably true; 


C—if the foregoing statistics clone are not sufficient to indicate whether there is 
any degree of truth or falsity in the statement; 


D—if the foregoing statistics alone are sufficient to indicate that the statement is 
probably false; 


E—if the foregoing statistics alone are sufficient to prove the statement false. 
30. Typically farmers are completely uneducated. 


31. The professions absorb a larger percentage of mal 
other group in the country. 


Sons of unskilled workers and sons of farmers have an approximately equal chance 
to go to college. 


е college graduates than any 


32. 


33. 


Educational opportunity for the lower classes is increasing. 
34. 


The same proportions of farmers and of unskilled workers are college graduates.?? 


Items of this type should be used as test questions only when students 
have already been given class exercises in the use of these terms. Other- 
wise the student may use responses (B) and (D) to indicate his doubt 


about his own answer, rather than his doubt concerning the adequacy of 
supporting data. 


By means of test exercises of this type, the teacher can obtain evidence 
concerning (1) the student’s ability to recognize the truth or falsity of gen- 


29 Benjamin 5. Bloom, op. cit., p. 110. 


Taxonomy of Educational Objectives 383 


eralizations that are clearly supported or negated by the data, (2) his abil- 
ity to identify inferences for which the correct response is “insufficient 
data"; and (3) his tendency to “go beyond the data” or to be overcautious 
in extrapolating from the data to infer trends.*° 

The following question illustrates how we can test the student's ability 
to extrapolate by making inferences regarding the probable point of view 
of a historical leader, a political party, or some other type of leader or 
group. 


"What choices, then, are left us in the realm of foreign policy? | see only two: im- 
Perialistic adventuring and the active promotion of world peace, and which of these 
alternatives is likely to supply the more favorable conditions for the continuance of 
Constitutional democracy among us is hardly open to reasonable doubt. Even wars 
fought for the most generous ends can still spell disaster for that complex set of values 
which our Constitution aims to uphold and promote." 


The words most nearly reflect the sentiments of 


(A) George Washington 
(B) Abraham Lincoln 
(С) Woodrow Wilson 
(D) Theodore Roosevelt 
(E) Thomas Jefferson?! 


This question tests both knowledge and extrapolation. Extrapolation 
items in the interpretation-of-data exercises provide a purer measure of 
Competency in extrapolation by providing the data from which inferences 
сап be made. 


3.00 APPLICATION 


The chief difference between the categories of Comprehension and Appli- 
cation is that the latter involves facing a new problem, or a problem that 
appears quite unfamiliar until the student has restructured the elements 
Into a familiar context. Comprehending an abstraction does not guarantee 
that the individual will be able to recognize its relevance and apply it cor- 
rectly in real-life situations. Students need practice in restructuring unfa- 


. 20 By the use of a stencil scoring key, the position of each correct answer can be 
indicated on each student's answer sheet. Scores for "going beyond the data" or 
Overgeneralizing can be obtained by counting incorrect answers that are in the di- 
Fection of the extremes of the scale, while the “over caution” score is obtained by 
Counting incorrect answers in the direction of the center of the scale. “Crude errors” 
are errors that traverse the center of the scale, that is, are on the opposite side of 
Center from the keyed response. 
% 4 Description of the College Board Achievement Tests, op. cit., p. 111. 


384 THE IMPROVEMENT OF INSTRUCTION 


miliar problem situations and applying the concepts and principles they 
have learned. | 

As our technological world has become more complex and more rapidly 
changing, the application of learnings to new problems has become even 
more important. Hence, the effectiveness of education cannot be adequately 
appraised unless we find out how well students can apply what they have 
learned in situations that differ from the textbook situations in which the 
concepts and principles were originally studied. 

The test items presented here as representing the Application category 
might involve only knowledge or comprehension if the specific problem 
presented in the test item had been studied in the textbook or in classroom 
discussions. In order to test Application, a test situation must be new to 
the student or contain new elements that require rethinking or restructuring 
of the material learned. In our attempts to set up new situations, three 
approaches are useful: (1) presenting a fictional situation, (2) using ma- 
terial with which students are not likely to have had contact (such as sim- 
plified versions of complex problems studied in more advanced work), and 
(3) taking a new slant on common situations? 

In some Application items, such as the following, 


the process by which 
the student reaches the solution of the proposed pro 


blem is not shown. 
(1) If the earth were viewed from the moon, 


would be true? The earth would 
а 


which one of the following statements 


- appear black because it generates по light. 

b. show phases similar to those we see on the moon. 

€. appear about the same size as the moon does to us. 

d. eclipse the sun during a considerable part of each month. 
In other items, 

principles and a 

The following е 


practically the entire process of choosing the correct 


Pplying them in the solution of the problem is recorded. 
Ssay item is an example: 


Y cleaned a ten-gallon glass tank 

ne washed sand. He rooted several stalks 
of weed (elodea) taken from 9 pool and then filled the aquarium with tap water. After 
waiting a week, he stocked the aquarium with ten one-inch goldfish and three snails. 
The aquarium was then left in а corner of ће г. 


What prediction, if any, can be made concerning the condition of the aquarium after 
a period of several months? 


82 Benjamin S. Bloom, op. cit., р. 130. 
33 Reprinted by permission of The California T 
Adams et al., California Tests in Social and Relate. 
(Monterey, Calif.: California Test Bureau, 1954), Т, 


est Bureau from Georgia Sachs 


d Sciences, Advanced, Form AA 
est 5, item 7. 


Taxonomy of Educational Objectives 385 


If you believe a definite prediction can be made, make it and then give your reasons. 


If you are unable to make a prediction for any reason, indicate why you are unable 
to make a prediction (give your reasons).?* 


In the following objective item, the entire process of selecting a correct 
or incorrect conclusion and supporting the conclusion by relevant or irrele- 
vant reasons is clearly recorded. 


Ап electric iron (110 volts, 1000 watts) has been used for some time and the plug 
contacts have become burned, thus introducing additional resistance. How will this 
affect the amount of heat which the iron produces? 


Directions: Choose the conclusion which you believe is most consistent with the facts 
given above and most reasonable in the light of whatever knowledge you may have, 
and mark the appropriate space on the Answer Sheet. 


Conclusions: 


A. The iron will produce more heat than when new. 
B. The iron will produce the same heat as when new. 
C. The iron will produce less heat than when new. 


Directions: Choose the reasons you would use to explain or support your conclusion 
«nd fill in the appropriate spaces on your Answer Sheet. Be sure that your marks are 
in one column only—the same column in which you marked the conclusion. 

Reasons: 


1. The heat produced by an electrical device is always measured by its power rating. 
It is independent of any contact resistance. 


2. Electric currents of the same voltage always produce the same amount of heat, and 
burned contacts do not decrease the amount of electricity entering the iron. 

3. The current which flows through the iron is reduced when the resistance is increased. 

4. Increasing the resistance in an electrical circuit increases the current. 

5. An increase in electrical resistance increases the heat developed. 

6. Manufacturers of electric irons urge that the contacts be kept clean to maintain maxi- 
mum efficiency. 

7. An increase in the temperature of a wire usually results in an increase in its resistance. 

8. Burned contacts increase the heat developed in an electric iron just as increasing the 

А friction in automobile brakes develops more heat. 


. The heat developed by an electric iron when connected to 110 volts is independent 
of the flow of current.35 


In most cases, we administer Application items in order to evaluate the 
effectiveness of instruction and the growth of students toward course ob- 
Jectives. Hence, we are not interested in the extent to which students can 


. 34 Adapted from PEA Test 1.3 B, "Application of Principles in Science," Evalua- 
tion in the Eight-Year Study, cited by The Measurement of Understanding, Forty- 
fifth Yearbook, Part I (Chicago: National Society for the Study of Education, 
1946), p. 111. 

85 Problem VI from PEA Test 1.3, “Application of Principles,” cited by Benjamin 
S. Bloom, op. cit., p. 132. 


386 THE IMPROVEMENT OF INSTRUCTION 


solve a problem by common sense or on the basis of common knowledge; 
rather, we are interested in the extent to which the student has learned to 
apply the concepts and generalizations taught in a specific course. One 
must therefore guard against the inclusion of clues to the solution of a 
problem that would be helpful to the bright student without specialized 
knowledge. Perhaps the best safeguard against items that evaluate general 
problem-solving ability is to administer them to persons who equal our 


students in intelligence but have not taken the specific course for which 
the exercise was designed. 


4.00 ANALYSIS 


Skill in analysis is included as an objective in many subject fields. Teachers 
of science and social studies want their students to be able to distinguish 
facts from hypotheses in both written and spoken communications, distin- 
guish major from subordinate ideas, and recognize when unstated assump- 
tions are involved in reaching a conclusion. Teachers of music and 
literature want students to be able to distinguish dominant and subordinate 
themes, to find evidence of a composer's or author's techniques and 
purposes. 

Analysis implies breaking down a communication into its 
seeing the way in which the parts are organized in rel 
other. At the lowest level of analysis, the student is ex 
and classify the elements of the communication, for exam 
between hypotheses and conclusions, or recognizing the 
tions being made by the author. At the second level, analysis is concerned 
with the study of relationships among the elements of a communication, or 
among the various parts of a document; for example, the relevance of 
supporting points to a central idea, or the relationship of a hypothesis to 
evidence presented in support of it. The third and highest level of analysis 
involves the analysis of organizational principles. An author or composer 
rarely points out the organizational principles he has used in developing a 
speech, a poem, a play, or a symphony. Yet the reader or listener may 
not achieve full understanding of a communication until he has discerned 
a speaker's underlying point of view, identified the techniques the artist is 
using, or recognized in a play or sonnet its underlying structure or pattern. 

Test items in this field should present the student with new material, 
rather than material that has already been analyzed in the text or in class 
discussions. The material presented for analysis may be an unfamiliar selec- 
tion from literature, a report of a new experiment, a description of a hypo- 
thetical social situation, or an unfamiliar picture or musical composition. 
The test items may be of either the essay or objective type. When objective- 


parts, plus 
ationship to each 
pected to identify 
ple, differentiating 
unstated assump- 


Тахопоту ој Educational Objectives 387 


type exercises are to be constructed, it is best to first administer parallel 
essay items and analyze student responses to them. As a result of such 
analysis the distractors or wrong alternatives in the objective test can rep- 
resent errors in analysis that students actually make. 

In a standardized test of critical thinking, the following exercise on 
“Inference” is included." 


Directions: 


T if you think the inference is definitely TRUE; that it properly follows beyond a 
reasonable doubt from the statement of facts given. 

PT if, in the light of the facts given, you think the inference is PROBABLY TRUE; 
that there is better than an even chance that it is true. 

10 if you decide that there are INSUFFICIENT DATA, that you cannot fell from the 
facts given whether the inference is likely to be true or false; if the facts provide 
no basis for judging one way or the other. 

PF if, in the light of the facts given, you think the inference is PROBABLY FALSE; 
that there is better than an even chance that it is false. 

F if you think the inference is definitely FALSE; that it is wrong, either because it 
misinterprets the facts given, or because it contradicts the facts or necessary in- 
ferences from those facts. 


In 1946 the United States Armed Forces conducted an experiment called “Operation 
Snowdrop” to find out what kinds of military men seemed to function best under severe 
Arctic climatic conditions. Some of the men selected came from Northern European stock 
while others came from Latin or Mediterranean stock; some were stout and some were 
thin; some were draftees and some volunteers; some had normal blood pressure while 
some had slightly high or low blood pressure. АП of the participants in "Operation 
Snowdrop” were given a training course in how to survive and function in extreme 
cold. At the conclusion of the experiment it was found that the only two factors among 
those studied which distinguished beween men whose observable performance was rated 
=з "effective" and those rated as "ineffective" on the Arctic maneuvers were: (1) де- 
Sire to go (volunteer versus draftee), and (2) degree of knowledge and skill regarding 
how to live and protect oneself under Arctic conditions. 


11, Despite the training course given to all of the participants in "Operation Snow- 

drop," some exhibited greater Arctic survival knowledge or skill than others. ..... 

12. The Armed Forces expected that important future military operations might be 

carried ‘on in thie Arctic. sacs уша cer etm eme RRE NAR REA Krie ehm niit фа» ал элек ога 

13. A majority of the men who participated in "Operation Snowdrop" thoroughly dis- 

liked thatexperience.... sss sae «жа каз gan Horn orua saaie BAE RAIE RER enn 

14. As a group, the men of Nordic backgrounds were found able to withstand the cold 

and to function more effectively than those of Latin backgrounds. .............. 

15. Participants who were normal in weight and blood pressure were found much better 

than other participants at acquiring skills to protect themselves under Arctic 

CONditions. sa sus teya аъ» mee suda ee hh heh hh hh 9 hm hl 

56 Watson-Glaser Critical Thinking Appraisal, Revised Form Zm, items 11—15. 

Copyright 1961 by Harcourt, Brace & World, Inc., New York, N. Y. Copyright in 
Great Britain. All rights reserved. Reproduced by special permission. 


388 THE IMPROVEMENT OF INSTRUCTION 


Test items on analysis of relationships are especially suitable for open- 
book tests. The student can show his ability to check the relevance of points 
made by the author to his central idea. 

The following item is based on an excerpt from Lindsay's Тле Modern 
Democratic State.** 


The relation between the definition of sovereignly given in Paragraph 2 and that given 
in Paragraph 9 is best expressed as follows: 
1. There is no fundamental difference between them, only a difference in formulation. 
2. The definition given in Paragraph 2 includes that given in Paragraph 9, but in addi- 
tion includes situations which are excluded by that given in Paragraph 9. 
3. The definition given in Paragraph 9 includes that given in Paragraph 2, but in addi- 
tion includes situations which are excluded by that given in Paragraph 2. 
4. The two definitions are incompatible with each other; the conditions of sovereignty 
implied in each exclude the оћег.28 
The following items are based on a musical composition, which is played 
for the students. 
(1) The general structure of the composition is 
1. theme and variations. 
2. theme, development, restatement. 
3. theme 1, development; theme 2, development. 
4. introduction, theme, development. 
(2) The theme is carried essentially by 
1. the strings. 
2. the woodwinds. 
3. the horns. 
4. all in turn.?? 


5.00 SYNTHESIS 


Synthesis involves the student's combinin 
as to constitute a pattern or structure tl 
experiences or materials are combined 
a new integration. This category is the 
creative activity on the part of students, 

Not all essay questions belon 
the student is required only t 
statement into his own words. 


2 elements or parts in such a way 
hat is new to him. As a rule, new 
with those previously learned into 
one that most clearly provides for 


5 under Synthesis, In many essay questions, 
© translate what he recalls of the textbook 
Or an essay question may simply require the 
87 Alexander D. Lindsay, 
versity Press, 1947). 


38 Benjamin S. Bloom, op. cit., р. 159. 
39 Ibid., p. 161. 


The Modern Democratic State (New York: Oxford Uni- 


Taxonomy of Educational Objectives 389 


Student to analyze the structural elements in a sonnet, or to recognize 
the structure of a musical composition. 

Three subcategories under Synthesis are distinguished on the basis of 
the products of the creative restructuring process: (1) production of a 
unique communication, (2) production of a plan, or proposed set of oper- 
ations, and (3) derivation of a set of abstract relations. It is assumed that 
these three fairly distinct kinds of products require somewhat different 
cognitive processes. 

In a test item or assignment requiring synthesis the student is allowed 
considerable latitude with respect to the content of his communication and 
hence can draw freely upon his own ideas, feelings, and experiences. Yet 
he is not allowed completely free expression because the task is structured 
to show how much the student has grown with respect to such synthesis 
Objectives as 


Ability to make an extemporaneous speech 

Ability to write an informative essay 

Ability to write a short story (or a poem) that others would find interesting and 
entertaining 

Ability to set a short poem to music. 


When the Portland, Oregon, schools decided to select their most talented 
Students in various fields, they developed five exercises in creative writing 
that could be classified as “unique communications.” Two of the five exer- 
Cises are briefly described. 


Developing Expressive Sentences 

After the children had had some preparatory work in this type of exercise, 
the children were given several sentences, for example, "The man went down 
the street.” The children were asked: “In what way could you add to or change 
the word ‘man’ to give a clearer picture of the man? In what ways could 


you change other words in the sentence to make us see this man going down 
the street?” 


Developing а Paragraph from a Sentence 

After the children had had some preliminary experience with similar assign- 
ments, they were asked to choose one sentence from a group of suggested 
Sentences and write a paragraph about it. Sentences which would stimulate the 


imagination of children were used, such as: “The mysterious box drew all eyes 
to 16,20 


10 Adapted from Robert C. Wilson, “Improving Criteria for Complex Mental 
Processes,” Invitational Conference on Testing Problems (Princeton, N.J.: Educa- 
tional Testing Service, 1957), pp. 14-17. 


390 THE IMPROVEMENT OF INSTRUCTION 


The following two sets of directions vary with respect to the amount of 
structuring of the exercise. The first permits much more freedom in choice 
of content than does the second. 


(1) “Think of some time in your own life when you were up against a difficulty, зоте" 
thing that stood in your way and had to be overcome. Make up а story around this 
difficulty and tell it to the class." 


(2) "Think of a plot based upon an obstacle that could occur between the following 
two sentences, and then develop a short story using these sentences and your plot." 


It was an event to be honored with a party, preferably a surprise party... "lt 
was a surprise, all right—a surprise all the way аогопа! 1 


These examples have all been concerned with creative writing or ex- 
temporaneous speaking. However, similar test situations can easily be de- 
vised in art, music, creative dance, or creative dramatics. 

In an exercise involving the development of a "plan or proposed set of 
operations," the requirements for the student's product are usually pre- 
sented in the form of specifications to be met or data to be considered. 
The student is encouraged to develop his own approach to the problem. 
For example, a science student might be asked to propose ways of testing 
specific hypotheses. A student in teacher education might be asked to plan 
à unit of instruction to achieve specified objectives 
level; however, he would be encouraged to use his o 
Specific content and activities. 


at a specified grade 
wn ideas concerning 


Although such a 
sociology, the a 
ganizing his lea 

Any exercise 
ing, or commercial art) that re 


The student who observes many related phenomena and formulates an 


41 Benjamin S. Bloom, op. cit., p. 178. 
42 Ibid., pp. 180-181. 


Тахопоту ој Educational Objectives 391 


hypothesis that adequately accounts for them is performing the third type 
of synthesis task. For example, the pupil who formulates for himself state- 
ments about the relationships between corresponding parts of similar 
triangles or makes similar discoveries in mathematics or science would be 
making progress toward a synthesis type of objective. 

In testing for progress toward synthesis objectives, we need to help stu- 
dents feel free from pressure to conform to the views or preferred methods 
of the teacher. Too much control and too many instructions will stifle 
students’ creativity. Enough time must also be allowed for the student to 
become thoroughly acquainted with an unfamiliar task, to explore possible 
approaches, and to reach a synthesis that seems best for him. Sometimes 
the time problem can be partially met by allowing the students to do some 
of their preparation before the examination period. Special reading mate- 
rials that would help in the writing of an essay, for example, can be 
distributed and studied ahead of time. 

The evaluation of the products of synthesis present another problem in 
that Objective criteria of worth are often lacking. The independent judg- 
ments of qualified persons are perhaps the only basis for evaluation of 
many of these products. The problem of improving the validity and relia- 
bility of such subjective judgments is considered further in Chapter 12. 


6.00 EVALUATION 


Any person makes innumerable evaluations daily as he judges persons, 
Objects, or activities as being more or less useful to him, more or less 
attractive, or as either enhancing or threatening to his status of self- 
esteem. Many of these evaluations are highly egocentric and quickly made 
Without careful consideration. In the taxonomy, these are considered to be 
Opinions, while the term "evaluation" is reserved for those evaluative 
judgments that are, or can be, consciously made with distinct criteria in 
mind. They usually require fairly adequate comprehension and analysis as 
а basis for judgment. The student can readily see how an evaluation of a 
Standardized test in terms of the criteria presented in this textbook would 
differ from an opinion that the XYZ test, used by a highly regarded school 
district, would be a good one for local use. 

Evaluation is placed last among the categories because it is a complex 
Process that involves some combination of all the other behaviors. It can 
be defined as the making of judgments about the value, for some purpose, 
of an idea, method, solution, or product. The criteria used in making value 


judgments in a test exercise may either be given to the student or deter- 
mined by him. 


394 THE IMPROVEMENT OF INSTRUCTION 


(B) if the line is inappropriate in rhythm or meter, 
(C) if the line is inappropriate in style or tone, 
(D) if the line is inappropriate in meaning. 


Hail, bards triumphant! born in happier days, 
Immortal heirs of universal praise! 
Whose honors with increase of ages grow 


56. like saplings stuck in dirt long years ago. 

57. As streams roll down, enlarging as they flow. 

58. As streams in desert lands where hot winds blow. 

59. From obscurity to fame the world around doth know.44 


The following essay question abou 


t à poem represents a type of item 
that could easily be adapted to the e 


valuation of many different products. 


Essay items give the student an opportunity to demonstr. 
tence, but they do not focus sharply on the desired behavi 
plores the shortage of good objective items in the evaluati 
concludes the taxonomy with this final Statement: “Perh 
value of the taxonomy . . . is іп pointing to the need for f 


development of testing techniques for measuring compete 
documents, materials, and works,’*¢ 


ate his compe- 
ors. Bloom de- 
on category, He 
aps the greatest 
urther study and 
псе in evaluating 


SUMMARY STA TEMENT 


adequate basis for making 
all course objectives shou 
the test and should be giv 


ievement in a course, 
able of specifications for 
s in the design and selec- 


3 sciet : У, which provides а basis 
for classifying both objectives and test items, can aid materially in increasing 
às well as standardized ones, 


ld be represented in the t 


44.4 Description of the College Board Ach 
45 Benjamin S. Bloom, ор. cit., p. 198. 
36 Ibid. p. 195. 


ievement Tests, ор. cit., p. 36. 


Taxonomy of Educational Objectives 395 


The concepts involved in the taxonomy are complex and interrelated ones. 
In order to help students to understand and use the taxonomy, the author has 
included in Table 11.1 illustrative objectives for each of the 20 subcategories. 
In the text of this chapter, an attempt has been made to define and illustrate 
cach subcategory and to clarify distinctions between them. In providing test 
items illustrative of each subcategory, the author has attempted to include 
items that are desirable models of item writing and ones that represent a large 
number of subject areas. In this way, the specimen items can serve the dual 
Purpose of illustrating the classifications of the taxonomy and of providing 
additional samples of item writing in mathematics, science, history, foreign 
language, and other subject areas. 


SELECTED REFERENCES 


BLOOM, BENJAMIN s., ed., Taxonomy of Educational Objectives, Handbook I: 
Cognitive Domain. New York: David McKay Company, Inc., 1956. 
COOK, DESMOND L., “The Use of Free Response Data in Writing Choice-Type 
Items,” Journal of Experimental Education, vol. 27 (December 1958), 

pp. 125-133. 

CURETON, EDWARD E., “The Rearrangement Test,” Educational and Psychologi- 
cal Measurement, vol. 20 (Spring 1960), рр. 31-35. 

EBEL, ROBERT L., “Writing the Test Item,” in E. F. Lindquist, ed., Educational 
Measurement. Washington, D.C.: American Council on Education, 1951, 
Chapter 7. 

ENGELHART, MAX D., “Suggestions for Writing Achievement Exercises to be 
Used in Tests Scored on the Electric Scoring Machine,” Educational and 
Psychological Measurement, vol. 7 (Autumn 1947), pp. 357-374. 

GERBERICH, J, RAYMOND, Specimen Objective Test Items: A Guide to Achieve- 
ment Test Construction. New York: David McKay Company, Inc., 1956. 

GRAHAM, G., Teachers Can Construct Better Achievement Tests. Curriculum 
Bulletin No. 170. Eugene, Ore.: University of Oregon, 1956. 

HAWKES, HERBERT E., E. F. LINDQUIST, AND C. R. MANN, The Construction and 
Use of Achievement Examinations. Boston: Houghton Mifflin Company, 
1936, Chapter 7. . 

The Measurement of Understanding, 45th Yearbook, National Society for the 
Study of Education, Part I. Chicago: University of Chicago Press, 1946. 

NEDELSKY, LEO, “Ability to Avoid Gross Error as a Measure of Achievement,” 
Educational and Psychological Measurement, vol. 14 (Autumn 1954), pp. 
459-472. 

REINER, WILLIAM B., "Evaluating Ability to Recognize Degrees of Cause and 
nue Relationships," Science Education, vol. 34 (February 1950), pp. 

—28. 

SMITH, Е, R., R. W. TYLER, AND OTHERS, Appraising and Recording Student 
Progress. New York: Harper & Row, Publishers, Inc., 1942. 

WEITZMAN, ELLIS, AND WALTER J. MCNAMARA, "Apt Use of the Inept Choice 
in Multiple Choice Testings," Journal of Educational Research, vol. 39 
(March 1946), pp. 517-522. 

Woon, DOROTHY ADKINS, Test Construction: Development and Interpretation 
of Achievement Tests. Columbus, O.: Charles E. Merrill Books, Inc., 1960. 


396 THE IMPROVEMENT OF INSTRUCTION 
English and Speech 


BRANDENBURG, ERNEST, AND PHILIP A. NEAL, "Graphic Techniques for Evalu- 
ating Discussion and Conference Procedures,” Quarterly Journal of 
Speech, vol. 39 (April 1953), pp. 201-208. | | uu" 

CROWELL, LAURA, "Rating Scales as Diagnostic Instruments in Discussion, 
Speech Teacher, vol. 2 (January 1953), pp. 26-32. 

DIEDERICH, PAUL В., "Making and Using Tests," English Journal, vol. 44 (March 
1955), pp. 135-140, 151. 

„ "Self-Correcting Homework in English," in Helen Huus, ed., Educa- 


tion: Intellectual, Moral, Physical. Philadelphia: University of Pennsyl- 
vania Press, 1960, pp. 258—271. 


‚ "Testing in the New English Program,” 

1953), pp. 11-17. 

DRESSEL, PAUL L., AND L. B. MAYHEW, Harndbo: 
Iowa: William C. Brown Co., 1954. 


GATES, ARTHUR I., A List of Spelling Difficulties in 3,876 Words. New York: 
Bureau of Publications, Teachers College, Columbia University, 1937. 
HARRIS, CHESTER W., “Measurement of Comprehension of Literature," School 

Review, vol. 56 (May, June 1948), pp. 280-289, 332-342, 

HUDDLESTON, EDITH, "Measurement of Writing Ability at the College-Entrance 
Level: Objective vs. Subjective Testing Techniques," Journal of Experi- 
mental Education, vol. 22 (March 1954), pp. 165—213. 

PALMER, OSMOND E., “Evaluation of Communication Skills,” in Paul L. Dressel 
and Associates, Evaluation in Higher Education. Boston: Houghton Mifflin 
Company, 1961, pp. 192-226. 

SMITH, DORA V., ed., The English Language Arts, Commission on the English 


Curriculum, National Council of "Teachers of English, Curriculum Series, 
vol. 1. New York: Appleton-Century-Crofts, 1952, Chapter 18. 
SWEARINGEN, MILDRED E., "Evaluation i 


n the Language Arts Program," Chil- 
dren and the Language Arts. Englewood Cliffs, N.J.: Prentice-Hall, Inc., 
1955, Chapter 20. 


THOMAS, EDNA 5., Evaluating Student Themes, Madison, Wisc.: University of 
Wisconsin Press, 1955. 

THOMAS, MACKLIN, “Construction Shift Exercises in Objective Form,” Educa- 
tional and Psychological Measurement, vol. 16 (Summer 1956), pp- 
181-186. 


VORDENBERG, WESLEY, “How Valid Are Objective English Tests?" Е lish Jour- 
nal, vol. 41 (October 1952), pp. 428-429, ^. Tests?™” English 


English Record, vol. 3 (Spring 


ok for Theme Analysis. Dubuque, 


Foreign Language 


AGARD, FREDERICK B., AND HAROLD B. DUNKEL, Ап Investigation ој Second- 
Language Teaching. Boston: Ginn and Company, 1948, 
SORNELIUS EDWIN Ta Jm Language: Teaching: 4 Guide gor Тедећане of 
Foreign Languages. New York: Thomas Y. Crowell Company, 1954. 
MANUEL, HERSCHEL T., “The Use of Parallel Tests in the Study of Foreign 
Language Teaching,” Educational and Psychological Measurement, vol. 13 
(Autumn 1953), pp. 431—436. 

PIMSLEUR, P., "French Speaking Proficiency Test,” French Review, vol. 34 
(April 1961), pp. 470-479. 


Taxonomy of Educational Objectives 397 


PIMSLEUR, P., AND OTHERS, “Foreign Language Learning Ability," Journal of 
Educational Psychology, vol. 53 (February 1962), pp. 15—26. 

RAYMOND, JOSEPH, “A Controlled Association Exercise in Spanish," Modern 
Language Journal, vol. 35 (April 1951), pp. 281-291. 

SADNAVITCH, J. M., AND W. L. POPHAM, "Measurement of Spanish Achievement 
in the Elementary School," Modern Language Journal, vol. 45 (November 
1961), pp. 297-299. 

SCHENK, ETHEL А., Studies of Testing and Teaching in Modern Foreign Lan- 
guages. Madison, Wisc.: Dembar Publications, 1952. 


Mathematics 


BROWN, CLAUDE H., The Teaching of Secondary Mathematics. New York: 
Harper & Row, Publishers, Inc., 1953, Chapter II. 

CLARK, JOHN R., ed., Emerging Practices in Mathematics Education. 22d Year- 
book, National Council of Teachers of Mathematics. Washington, D.C.: 
Тће Council, 1954, Part 5. 

DONOVAN, JOHNSON, WITH OTHERS, "The Evaluation of Mathematical Learn- 
ing," Emerging Practices in Mathematics Education. 22d Yearbook, 
National Council of Teachers of Mathematics. Washington, D.C.: The 
Council, 1950, Part 5, pp. 339-409. 

FAWCETT, HAROLD P., ed., The Nature of Proof. 13th Yearbook, National Coun- 
Cil of Teachers of Mathematics. Washington, D.C.: The Council, 1941. 

MYERS, SHELDON s., Published Evaluation Materials in Mathematics. Annotated 
Bibliography of Mathematics Tests. Princeton, N.J.: Educational Testing 
Service, 1961. Reprinted from Evaluation in Mathematics, Washington, 
D.C.: The National Council of Teachers of Mathematics, 1961. 

PICKETT, HALE, An Analysis of Proofs and Solutions of Exercises on Plane 
Geometry Tests, Contributions to Education, No. 747. New York: Bureau 
of Publications, Teachers College, Columbia University, 1938. 

REEVE, w, p., “Evaluation Program in Secondary Mathematics," School Science 
and Mathematics, vol. 55 (February, March 1955), pp. 123-140, 216-228. 

SIMPSON, m. H., “Mathematics Teachers and Self-Evaluation Procedures,” 
Mathematics Teacher, vol. 56 (April 1963), рр. 238-244. | 

SPACHE, GEORGE, “A Test of Abilities in Arithmetic Reasoning," Elementary 
School Journal, vol. 47 (April 1947), pp. 442-445. 


Reading 


BLOMMERs, PAUL, AND E. F. LINDQUIST, “Rate of Comprehension of Reading; 
Its Measurement and Its Relation to Comprehension,” Journal of Educa- 
tional Psychology, vol. 35 (November 1944), pp. 449-473. 

BRYAN, MIRIAM, “Сап We Really Measure Reading Comprehension?—A Test- 
ing View,” The Journal of the Reading Specialist, vol. 2 (September 
1962), pp. 4-5. 

HUSBANDS, К. L., AND J. HARLAN SHORES, “Measurement of Reading for Prob- 
lem Solving: A Critical Review of the Literature,” Journal of Educational 
Research, vol. 43 (February 1950), pp. 453-465. 

NASLAND, R. A., AND OTHERS, “Evaluation and the Reading Program,” Clare- 
mont Colleges Reading Conference Yearbook, 1961, pp. 133-141. 


400 THE IMPROVEMENT OF INSTRUCTION 


5. If tests of ability to apply principles were given annually to high school 
students of science, what influence could be expected on the instructional pro- 
gram in that area? 

6. Prepare an exercise in which a student is asked to suggest or criticize 
procedures for testing an hypothesis in some area of science. 

7. Select from standardized science tests а number of exercises that test 
science understandings and the ability to use the scientific method, rather than 
the memorization of isolated facts. 

8. Members of a high school science faculty met to review a list of science 
objectives and to plan their evaluation program in terms of 
One teacher seemed to ex 
job seemed overwhelmin, 
of teachers? 


these objectives. 
press the feeling of the group when he said that the 


5. What practicable plans can you suggest to this group 


Evaluating 
12 Student Performance 
in the Skills 


Even a casual comparison of a sampling of available tests with a typical 
list of educational goals would convince the reader that the evaluation of 
student performance in the skills has been grossly neglected. Almost all 
Subjects have a number of important skills outcomes, such as laboratory 
skills in the sciences and the skills of handwriting, speaking, and effective 
writing in English instruction. Moreover, in fine arts, industrial arts, home- 
making, physical education, and vocational arts, the student’s performance 
In. various skills may assume even greater importance than his knowledge 
outcomes, 

Because knowledges can be easily and efficiently measured by paper-and- 
Pencil tests, teachers have rationalized their inattention to skills outcomes 
by assuming that there is a high relationship between knowledge and per- 
formance. Even examinations for admission to the bar, to teaching, and 
to medical practice have been largely verbal tests of examinees’ knowledge 
ОЁ facts and principles. 
| Knowledge is necessary, but not sufficient, to adequate performance 
In the skills. Knowledge of traffic rules, although important, is no guarantee 
of ability to drive; knowledge of rules in athletics does not correlate highly 
With performance, nor knowledge of recipes and nutrition with ability to 
Cook. “From the standpoint of validity one of the most serious errors com- 
mitted in the field of human measurement has been that which assumes the 

igh correlation of knowledge of facts and principles on the one hand and 
Performance on the other." 
Very little has been done to measure performance in process and the 


К ! David G. Ryans and Norman Frederiksen, "Performance Tests of Educational 
Achievement," in E. F. Lindquist, Educational Measurement (Washington, D.C.: 
merican Council on Education), p. 455. 


401 


402 THE IMPROVEMENT OF INSTRUCTION 


products of performance, since (1) in such measurement it is difficult to 
obtain an adequate sampling of skills, and (2) it is difficult to evaluate 
attainment in many skills with even fair objectivity. However, the respon- 
sibility to measure student attainment is inescapable. If course objectives 
include the development of skills, other than those of a verbal nature, the 
use of performance tests and/or fairly objective ratings of products and 
processes are essential to effective teaching and learning. 


In Chapter 5, tests were classified on the basis of the degree to which 
they directly measured criterion behavior. The most direct type was the 


work sample or “identical elements” test; the next type of test, involving 
some indirectness of measurement, was the "related behavior type." Most 


performance achievement tests are classifiable under one 


of these two types, 
that is, 


1. The work sample type, in which the examinee is given a special opportunity 
under standard conditions to do some of the tasks on which we want to 
appraise his competency, such as sewing, cooking, or driving a car. | 

2. The simulated situation type (classifiable under "related behavior"), in 
Which the examinee works in a test situation specially designed to be similar 
to the usual situation and to elicit the kinds of behavior we wish to measure 


(for example, students in a sewing class cut out the various pieces of a 
miniature dress pattern and pin them to a sheet of colored paper which 
represents dress goods). 


Both these types of tests might be further classified into ( 
objective scoring is possible because there is a clear- 


rightness and wrongness (as in typewriting or mechanical assembly ) and 
(b) those in which the scorin 


g must depend on the judgment of the ob- 
server (as in instrumental or vocal performance, automobile driving, and 
the like). 


a) those in which 
cut distinction between 


DEVELOPING TESTS OF SKILLS OUTCOMES 


As the teacher faces the task of evaluati 
and other physical skills, he recognizes the low validity of paper 
tests and the subjectivity of his daily observations. He 

consider the mastery of certain skills to be so important that they justify 
the development and use of performance tests, Such tests are especially 
valuable as a basis for diagnosis and reteaching. The results aid the in- 
structor in assessing his own effectiveness in demonstrating certain skills; 


they reveal to him the points upon which greater emphasis should be 
placed. 


ng student growth in manipulative 
-and-pencil 
may, therefore, 


Evaluating Student Performance in the Skills 403 


Before we can score a student's performance or product, we have to 
select the specimens of behavior to be evaluated and plan the standard 
conditions under which they are to be obtained. Our procedures in setting 
up performance achievement tests will be similar to those described in 
Table 4.2 on content validity. That is, our basic approach will be (1) to 
define the universe of skills to be sampled and (2) to sample that universe 
by procedures that can be clearly described. As in other types of achieve- 
ment tests, professional judgment will usually be required in the sampling 
process because it is seldom possible or efficient to use a random sampling 
of all the skills in a course or unit. 

The universe of skills to be sampled is usually defined in a list of objec- 
tives for the course (which indicates the skills in which students should 
become proficient). The validity of the test will depend largely on the tasks 
selected to represent these general skills. If the sampling of skills to be 
included in the test is well done, coaching for the test should improve the 
Student in the general abilities tested. 

The following guidelines for selecting tasks for a skills test should im- 


Prove test validity and the efficiency of measurement (per unit of testing 
time), 


1. Choose tasks that are representative of the significant skills emphasized in 
the course. 

· Choose tasks, or aspects of tasks, that are reasonably difficult for students. 
Since performance testing is time consuming, eliminate tasks, or parts of 
tasks, that almost everyone can do. (For example, in a performance test of 
Cooking skills, a student would fry bacon rather than potatoes; in a test 
of driving skills, a student would do parallel rather than angle parking.) 

- Plan a test that involves a minimum of repetition of identical procedures 

(for example, a test of driving skills should be planned so that the student 

Spends little time repeating right turns and other routine tasks). 

Choose tasks that are crucial to success on the job as a whole. The test 

Should provide opportunities to make those mistakes that are frequently 

Tesponsible for failure in the total task or for failure to progress to higher 

levels of proficiency. For example, a bandmaster would want to test students 

for accurate music reading and ability to come 1n at the proper place after 

а rest; a swimming teacher would place considerable weight on the student's 

ability to synchronize his breathing with the pattern. of his crawl stroke, as 

Well as those characteristics of swimming form that minimize water resistance. 

lf feasible, choose tasks that do not require too much time to perform, so 

that one can include a larger sampling of different tasks. (For example, a 

Student of statistics might complete several problems in which the easier, 

time-consuming work had been done for him; or a student of instrumental 

music might play selected passages from several different compositions.) 

Choose tasks in which the conditions of work can be made standard for all 

Students and the performance can be judged with considerable objectivity. 

For example, the student's skill in frying eggs (see Figure 12.3) could 

Probably be judged more objectively than his skill in frying potatoes. The 

Student's skill in backing а car down a marked lane can be judged more 


2 


404. THE IMPROVEMENT OF INSTRUCTION 


objectively than his ability in parallel parking. In the latter task, moreover, 
the observer must make allowance for the fact that cars differ in length and 
in ease of manipulation. : Р 

7. 1 possible, choose tasks that involve materials and equipment commonly 
used in the course, and of which there are sufficient sets available to permit 
a number of students to be tested at one time. 


In any performance test, it is important to standardize conditions of 
work. Students must work under similar conditions if comparison is to be 
possible. In comparing students with respect to physical education skills, 
the position in the court from which the basketball is thrown and other 
specifications must be clearly indicated. In comparing shorthand students 
with respect to their skill in taking dictation, recorded dictation materials 
at specified dictation speeds should be used. When students’ speeches or 
essays are compared, they should speak or write under similar conditions 
with respect to previous preparation, time allowed, type of subject as- 
signed,* and the like. The teacher who grades students in their skills solely 
on the basis of products made at home may be comparing the products 
of a student who has received no help with those of another who has re- 


ceived the aid of a proficient parent and has been able to use special tools 
not available to all. 

The following set of directions for a test of 
establishes standard working conditions an 
students. 


skill in laboratory techniques 
d minimizes questions from 


On the table you will fi 


nd a supply of frogs and the materials needed to make a 
dissection, 


1. Remove the skin from the frog's hind leg. 
2. Dissect the gastrocnemius and the tibialis anticus longus muscles of the lower leg free 
from other muscles but left attached to the bones, showing their origin and insertion. 


As soon as you have finished, notify the laboratory instructor so that the condition 
of the dissection can be scored before it has deteriorated, 


You will be allowed exactly twenty minutes to make this dissection,3 


3. 


2 Since the student's performance tends to Vary with his interest and experience 
on an assigned topic, several samples on a Variety of topics should be obtained. 

8 Herbert E. Hawkes, E. F. Lindquist, and C. R. Mann, The Construction and 
Use of Achievement Examinations (Boston: Houghton Mifflin Company, 1936), pp- 
253-254. ' j 


Evaluating Student Performance in the Skills 405 


difficult to construct and to administer; and (4) the fact that such tests tend 
to penalize students who cannot work well under pressure. 

A test of laboratory techniques in college physics, in which 18 students 
could be tested at the same time, was developed by Kruglak. He set up 
18 performance items, or stations, with a total of 35 possible responses. 
The verbal description of the given apparatus and the problem is typed 
on a 4- by 6-inch card, with each card being taped to the laboratory table 
next to the apparatus involved. At the beginning of the period, students 
are assigned to their initial stations at random. At three-minute intervals, 
as signaled by the teacher, each student moves to successive stations. An- 
other test, designed to measure student competency in more complex tech- 


niques, involves six stations to which students are assigned for nine minutes 
each.: 


SCORING PROCESSES AND PRODUCIS 


Performance in process may be scored in terms of speed, use of approved 
methods, or general quality of the performance. Speed (of running, swim- 
ming, typewriting, writing of shorthand characters, and many other skills) 
is important and can be objectively measured. Notations with respect to 
the use of approved methods, are of considerable value in diagnosis and 
Teteaching. The accuracy and quality of a process are usually judged in 
terms of their effect on the product. When this is impossible, as in vocal 
and instrumental music or physical skills, the subjective judgment of com- 
petent observers is used. 


Relative Advantages of Scoring Processes and Products 


The scoring of student performance in work samples or simulated- 
Situation tests may be based on (1) the performance in process, (2) tha 
Product of the performance, or both. In some cases, the product is not 
distinguishable from the process, as in instrumental music, speech, or oral 
Work in foreign language. In other cases, one can make a distinction, such 
а5 between the process of driving and the result (destination reached or 
distance traversed). In the case of driving, however, the process is all 
Important; and subjective judgment concerning the process, made by com- 
petent observers, is indispensable. Fortunately, in many cases, the product 


^H. Kruglak, “Experimental Outcomes of Laboratory Instruction in Elementary 
ae Physics,” American Journal of Physics, vol. 20 (January 1952), pp. 138- 


406 THE IMPROVEMENT OF INSTRUCTION 


is most important; in the composing of music, for example, analysis of the 
process is less important than evaluation of the product. 

In such subjects as typewriting, handwriting, cooking, and many others, 
both product and process can be evaluated. In such cases, we prefer to 
evaluate the products. Evaluation of products tends to be more reliable 
than evaluation of processes for a number of reasons: 


1. More time is available for judging products, while performance in process 
must be judged “on the wing.” 


2. Independent judgments of products, made by different evaluators, can be 
obtained, checked for interscorer reliability, and combined. 


- We can develop a scale of products (representing approximately equal dif- 
ferences in quality) to aid in objective scoring. 


We can train persons in the use of product scales; we can check their reli- 


ability in grading and the extent to which their judgments agree with those 
of adjudged experts. 


In some situations, early errors can irrevocably influence a product; that is, 
unless the product is evaluated at different Stages, We can score a product 


too low because of early errors (for example, an irremediable error in 
cutting a garment). 


Ranking Processes and Products 


When a teacher assigns grades to students in such 
as swimming or playing tennis, or when he gr 
they are essays or pies), he is presumably ranking the performances or 
products on some continuum of quality. Too often the characteristics used 
as a basis for grading differ from teacher to teacher, and also differ as the 
same teacher observes and evaluates the work of different students, One 
advantage of using checklists and rating scales, discussed in the next chap- 


ter section, is to minimize these differences with respect to selective atten- 
tion and recall. 


Another problem in grading is differences in 
to teacher, and from time to time with the same teacher). If all teachers 
rank students’ work, differences in generosity do not affect students’ 
scores; no teacher can place an unusually large number in the top 10 per- 
cent; whereas he could be very generous in assigning A’s. Moreover, the 
ranking process requires the teacher to make a more careful study of 
interindividual differences than is usually considered necessary in the as- 
signment of marks. 

Diederich’s suggestion for the sorting of essays into nine 
basis for assigning stanine scores, is a less arduous procedure 


performance skills 
ades their products (whether 


generosity (from teacher 


groups, as a 
than placing 


Evaluating Student Performance in the Skills 407 


all products in rank order. The following plan has been used by assistants 
or readers who have been trained to grade themes for teachers. 


The readers first sort the papers into five piles in order of merit, with 10%, 
30%, 20%, 30%, and 10% of the papers in each pile from low to high. Then 
they take the piles above and below the mean and sort them again in the ratio 
of two papers to three. Thus the first pile of 10% becomes two piles with 4% 
at the very bottom and 6% slightly better. The next pile of 30% becomes two 
piles with 12% worse papers and 18% better ones. The same proportions are 
observed for the two piles above the mean, so that we come out with nine 
piles with 4%, 6%, 12%, 18%, 20%, 18%, 12%, 6%, and 4% in ascending 
order of тегі... a very slight rounding error [has been made] at two points 
in the scale to make the proportions easy to remember and compute, but they 
аге extremely close to the true stanine proportions.* 


This procedure could be used with any other products that could be sorted, 
Such as drawings, blueprints, maps, photographs, or small work samples 
made in industrial arts or home economics. 


Using Checklists and Rating Scales 


As aids to the observers of processes, and the judges of products, check- 
lists or rating scales should be developed. A checklist merely provides a 
Systematic basis for recording observational data. A rating scale differs 
from a checklist in that qualitative judgments are made and recorded. 


USING CHECKLISTS А checklist is an aid to the observer in recording 
information regarding sequence of acts or use of approved methods. The 
Person using the checklist might simply check the actions that occur or the 
methods used. Or he might fill in numbers to indicate the sequence of 
actions; the completed checklist then constitutes a step-by-step summary 
of the students’ procedures. А 

A checklist might be used to record those elements in a complex task 
that had been satisfactorily completed. In scoring the dissection skills, 
tested in the performance test on page 404, the results of each student's 
Work is checked against a prepared list of the characteristics that the dis- 
Section would show if it were properly done. Ideally, this checklist should 
be developed by the students, or at least discussed with them, before it is 
Used to appraise their work. 


5 Paul B. Diederich, “Simplified Measurement Techniques for Teachers,” The 15th 
Yearbook, National Council on Measurements Used in Education (New York: The 
Council, 1958), p. 25. 


408 THE IMPROVEMENT OF INSTRUCTION 


SCORING FOR DISSECTION 


a. Is the skin completely removed from the leg and foot? (1). 
b. Are ће muscles, tendons, and joints uninjured and intact? (1) 


Score the remaining items for the gastrocnemius muscle and the tibialis anticus muscle 
separately. Allow the specified number of points credit for each muscle. 


Tib. 
Gast. Ant. 
с. 15 the muscle completely separated from adjacent muscles? (2. c 
d. Is the muscle attached at the origin? (1) 
e. 15 the muscle fully dissected at the origin, its attachment distinct? (1) 
f. Is the muscle attached at the insertion? (1) 
g. Is the muscle fully dissected at the insertion, 


its attachment distinct? 1С a 
h. Is fascia of muscle smooth, not torn? (2), 
i. Are the fiber bundles entire, not frayed out? (1) 


LU UI 


Score on Dissection 


Sum of items a to i inclusive 


Students whose dissections are poor can be observed individually as they 


repeat their work. Errors in procedure can be checked against a teacher- 
made list. 


USING RATING SCALES A rating scale, unlike the checklist, requires а 
qualitative evaluation of aspects of a total performance or product, or of 
Steps or subtasks within a series. The first step in constructing a rating 
scale is to break down the process or product into components. Decisions 
may also be made concerning the relative importance of different com- 
ponents. 

The rating scale in Figure 12.1 illustrates the rating of different steps of 
а performance in process, that is, sawing to a line with a rip and cross-cut 
Saw. The rating scale in Figure 12.2 illustrates the rating of different 
aspects of a process. Both these types of rating scales are more useful in 
diagnosis than an over-all rating of the process or product as a whole. For 
example, a student might receive a perfect score on all aspects of fastening 
Screws except items 3 and 7. If a total score on all items were computed, 
it would be high; yet a student who hopelessly splits the wood into which 
the screw is driven needs further teaching and practice, 

Rating scales also differ with res 


pect to type of scale used. In Figures 
12.1 and 12.2, a simple numerical 


al scale is used, In Figure 12.4 each nu- 
merical value is verbally defined in such a way as to encourage instructors 


5 For a suggested checklist for scoring the dissection performance in process, see 
Hawkes, Lindquist, and Mann, op. cit., pp. 254-255, 


Evaluating Student Performance in the Skills 409 


TO SAW TO A LINE WITH А RIP AND CROSS-CUT SAW 


Tools and Materials: Sharp пр saw and cross-cut saw, bench, wood vise, and 
piece of wood. 


Directions: Observe pupil as he works, and rate him on the following points: 


1. CLAMPING STOCK: + 2 4 $ 6 "V 9 


Stock should be so held that it will not be loosened or cracked, and 
that its position will facilitate sawing. 


- STARTING CUT: y 92, + 4 5 6 T S8 9 3J0 
With thumb at line, saw should be placed against the thumb. Saw 
should be pulled back slowly a few times to make a groove, then 
pushed forward. 

+ HOLDING SAW: 1 2 3 4 567 8 9 W 
Saw should be held firmly. For cross-cut saw, angle should be 45 
degrees; for rip saw, 60 degrees. 

· STROKE: i 2 S 4 5 6 7 $9 10 
Stroke should be long and even, not too fast. Proper angle should be 
kept during sawing. Line should be followed. 

· ENDING CUT: 123 4 5 6 7 8 9 10 


The piece being cut off should be held with the free hand. Saw strokes 
should be slow and with little pressure so as to prevent breaking off 
the end. 


o OO _———— 


Fig. 12.1. Form for Rating Different Steps in a Process. (À rough point- 
scale for judging ability to saw to a line with a rip and cross-cut saw.) 


_______ mmm 


Reprinted by permission of the publisher from M. M. Proffitt, and others, “The 
Measurement of Understanding in Industrial Arts," The Measurement of Under- 
Standing, 45th Yearbook of the National Society for the Study of Education, Part I 
(Chicago: National Society for the Study of Education, 1946), pp. 302-320. 


ee 


to distribute scores more widely than such terms as “superior,” “good,” 
and "fair"; in Figure 12.3 an attempt is made to define the extremes of 
each scale in terms of observable characteristics. 

Figure 12,3 illustrates good procedure in devising a rating form for 
Products. That is, good and poor products have been compared and those 
Characteristics that differentiate them have been identified and included in 


(a) NAILS 
(1) Straightness 


(2) Hammer marks 
(3) Splitting 
(4) Depth 
(5) Spacing 
(6) Utility 
(b) screws 
(1) Slots 
(2) Straightness 
(3) Splitting 
(4) Screw driver marks 
(5) Countersinking 


(6) Spacing 


(7) Utility 


Fig. 12.2. Form for Rating Different Аз 
rating form for “fastening” 


From D. C. Adkins, and others, Construc. 
(Washington, D.C.: Government Printing О 


flice, 1948), р. 231 


123456789 10 
Are nails driven straight, heads 
square with wood, no evidence of 
bending? 

12345 678 9 10 


Is wood free of hammer marks 
around nails? 


12 53 45 6 7 $ 9 10 


Is wood free of splits radiating from 
nail holes? 


123245267 8 9 10 
Are depths of nails uniform and of 
pleasing appearance? 
12324252678 9 10 
Are nails spaced too close or too 
far apart? 


12 9 4855 7 8 9 10 
Will the nails hold? 


123456789 10 
Are slots free of splitting and other 
evidence of driving strains? 

12 3 4 5 6 7 8 9 10 


Are screws straight, heads parallel 
with surface? 


123 45678 9 10 


Is wood free of splits in the area 
of screws? 


1 2 3$ 4355 7 8 9 10 
Is wood free of Screw driver marks 


near screws? 
12345 6 7 g 9 10 


Is countersinking neat and of satis- 
factory depth? 


123534 5678 9 10 
Are screws Spaced too close or too 
far apart? 
1.2 3 


: 5 6 7 8 9 10 
Will the screws hold? 


Pects of a Process. (Point-scafe 
in Woodworking.) 


tion and Analysis of Achievement Tests 


Evaluating Student Performance in the Skills 411 


the rating form. The same procedure is useful in devising a rating scale for 
judging performance in process. That is, teachers who are designing a rat- 
ing scale should compare good and poor performers (violinists, baseball 
pitchers, or performers in any other skill) as a basis for identifying those 
Component skills in which they differ widely. The reader will recognize 
that this procedure is similar in approach to the selection of test items on 
Which high-achieving and low-achieving students show the greatest differ- 
ence in performance. Many other suggestions for the improvement of rating 
forms and rating procedures are given in Chapter 8. 


1 2 3 SCORE 
Appearance 1. Dull Soft luster | 
Whi 
ОИЕ 2. Spread out Thick with 
and irregular rounded outline 2. ___ 
3. Greasy No excess fat E 
Appearance 4. Broken Whole 4.____ 
ој Yolk 
5. Not coated 
with white Coated with white 5. ___ 
Consistency 6. Watery or Uniformly 
of White very solid coagulated 6. 
Tenderness 7. Leathery or 
of White crisp and hard Tender d 
Taste and 8. Stale, flat, salty, 
Flavor or unpleasant Fresh, well 
fat flavor seasoned RENE 


Total Score 
 —-———————————Ü 


Fig. 12.3. Rating Scale for Eggs (Fried) 
кос Lu DUNG alas PARCI com RN 
Reprinted with the permission of the publisher from Clara Brown Arny, Minnesota 
90d Score Cards (Princeton, N.J.: Educational Testing Service, 1946). 


p OMNEM IMMUNE NNNM 


Using Product Scales 


A product scale is a graded series of products (usually five or more) 
Carefully chosen to represent successive levels of quality along an inferior- 
Superior continuum. In the evaluation of handwriting and composition 
Skills, product scales have been used for many years. In fact, the first 
Product scale in handwriting was developed by Thorndike in 1910. 


112 THE IMPROVEMENT OF INSTRUCTION 


In the development of a product scale, specimen products are selected 
(on the basis of ratings by experts) as representing different levels of qual- 
ity; these products are then used as a basis for grading students’ work. In 
order for the products to constitute an equal-interval scale, the difference 
in quality between specimens A and B should be approximately as great 
as between specimens B and C, and so on throughout the scale.* 

Once the scale of products has been developed and scores assigned, 
it can be used as the basis for evaluating student products. That is, each 
student's handwriting sample, essay, or other product is given the score 
of the specimen it most closely resembles in general quality. Product scales 
in handwriting have proved fairly satisfactory. The use of product scales in 


essay writing, however, has involved greater subjectivity and consequently 
lower reliability. 


ILLUSTRATIVE TECHNIQUES IN THE EVALUATION OF 
COMMUNICATION, MANIPULATIVE, AND ATHLETIC SKILLS 


Rating scales, product scales, and other evalua 
in evaluating products or processes in a wide 
amples will be given to illustrate the wide var 


tion techniques can be used 
variety of skills. A. few ex- 
iety of possible approaches. 


Communication Skills 


not only gives an over-all evaluation 
of the student’s speech but also rates it on the b 
criteria: (1) adaptation to the communicatio 
speaker, and audience); (2) structure; 


(originality, freshness, accuracy, adequacy, 


Evaluating Student Performance in the Skills 413 


rated. That is, student raters could underline statements that are especially 
applicable to the speaker. For use at the high school level, the scale should 
be simplified and probably reduced to a five-point rating scale. 

Product scales in essay writing have proved sufficiently reliable for 
evaluating the average level of student competency in a class or school, 
especially if teachers have been trained in their use. However, the problem 
of appraising individual proficiency in essay writing is not so easily solved. 
Students vary in their effectiveness from time to time and from topic to 
topic. After extensive research, the College Entrance Examination Board 
discontinued the grading of essays? and substituted objective and semi- 
Objective tests of related skills. 

The staff of the Educational Testing Service has done extensive re- 
Search concerning ways in which essays can be more objectively graded. 
In 1957, for the first time in several decades, a published test of essay 
writing of the product-scale type appeared as part of the STEP series. 
In order to obtain the specimen student essays for the eight essay topics at 
each level, five thousand student essays were examined and rated independ- 
ently by experts in the composition skills. 

The essay topics were selected to stimulate students to self-expressive, 
creative writing, rather than routine narration. The following essay topic is 
for Level 3 (grades 7-9): 


If you knew that you were to go blind twenty-four hours from now and that nothing 
could prevent it, what would you do with the time between now and this moment, this 
time tomorrow when you could see no more? Where would you go? What and whom 
would you try to see? Write an account of what you would do from the time you leave 


this room until your blindness strikes, explaining, if possible, the reasons for your 
Actions,10 


In the week-by-week evaluation of themes, a more individualized ap- 
Praisal than is possible with product scales is desirable. The way in which 
the teacher reads and evaluates students’ compositions can have lasting 
effects on their attitude toward writing. The teacher should not limit his 
Notations to proofreading symbols. Comments, directed to the individual 
writer and based on the teacher’s understanding of his needs and capacities, 
Can serve as a stimulus and guide to improvement. 


8 Harry A. Greene, "English—Language, Grammar, and Composition,” Encyclo- 
og of Educational Research (New York: The Macmillan Company, 1950), p. 

У Copies of each student's essays, however, аге still supplied to the colleges to 
Which he makes application. 

10 A Prospectus, Cooperative Sequential Tests of Educational Progress (Princetor 
N. J.: Educational Testing Service, 1957). 


414 THE IMPROVEMENT OF INSTRUCTION 
—————————————————————————À 


Directions: Indicate your rating on the five aspects of the speech by drawing a 
circle around the number which represents your rating in each case. 


Student. Subject, 


1. Adaptation to the Communication Situation 


SUPERIOR EXCELLENT GOOD AVERAGE FAIR POOR VERY POOR 
14 13 12 11 10 9 8 7 6 5 4 3 2 1 


A. Suited to the assignment: 
limits 

B. Suited to the speaker: ethical justification — well 
communicate 

C. Suited to the audience: clear articula 
visual aids where necessary—neatly 
facial expression—eye contact—con 
pitch, rate, and loudness—subject ap 


follows assignment—stays within set time 


prepared—desire to 


tion—correct pronunciation— 
dressed—poised in body and 
versational tone—variation in 
propriate 

П. Structure of the Speech 


SUPERIOR EXCELLENT GOOD AVERAGE FAIR POOR VERY POOR 
14 13 12 11 10 9 8 7 6 5 


4 3 2 1 
A. Introduction: captures attention— focuses 
to stated or implied purpose—establishe 


B. Body: subject-anal 
с. 


liv subordinate to pur- 
vision by one principle only 


D. Transition elements: verbal 


bridges between divisions clear 
E. Sentences: 


correct structure—varied structure—effective parallelism 


III. Developmental Materials 


SUPERIOR EXCELLENT GOOD AVERAGE FAIR POOR 
VERY POOR 
14 13 12 11 10 9 8 7 


6 5 4 3 2 d 


personal experiences—avoids outdated 
data—adapts old facts to new contexts 


B 

C. Accuracy of material: uses honest details—qualifies opinions—uses 
specific support—avoids questionable authority 

D. Adequacy of material: sufficient details—sufficient illustrative devices 

—use of statistics or apt quotations when available 

E. Relevancy of material: details pertinent—details realistic—connection 


between examples and generalization demonstrated 


Evaluating Student Performance in the Skills 415 
IV. Skill in Expression 


SUPERIOR EXCELLENT GOOD AVERAGE FAIR POOR VERY POOR 
14 13 I2 1 10 9 8 7 6 3 4 3 2 1 


A. Extemporaneous delivery: speaks without notes—minimum of vo- 
calized pauses—effective use of unvocalized pause—adaptation to 
audience reactions—rapport with audience 

B. Use of language: avoidance of clichés—sense of sentence rhythm— 
exactness in word choice—recognition of connotative value of words 

C. Use of voice: voice modulated to verbal symbols— pleasant tonal 
quality 

D. Use of body: projects alert body tone—purposive movement—co- 
ordinated movement— natural and spontaneous gestures 


V. Over-all Evaluation 


SUPERIOR EXCELLENT GOOD AVERAGE FAIR POOR VERY POOR 
14 13 12. u 10 9 8 7 6 5 4 3 2 1 


Name of student making the rating 
PIS SPAREN ORDEN аны? #чъзш ы a ._ 


Fig. 12.4. Scale for Evaluating Speaking 
hs 


Reprinted by permission of The American Council on Education from Paul L. 
Dressel and Lewis B. Mayhew, General Education: Explorations іп Evaluation 
(Washington, D.C.: American Council on Education, 1954), p. 81. 


nS c 


The following summary of “Principles of Theme Analysis,” although de- 
Veloped for use at the college level, has many implications for the high- 
School teacher of composition. 


PRINCIPLES OF THEME ANALYSIS! 


l. External motivation. Read a student's theme in the light of its external moti- 
vation—the audience and assignment to which it is addressed—and its suc- 
cess in meeting the requirements of that audience and assignment. 

2. Internal motivation. Strive to understand the student's internal motivation 

in writing the theme—what he is trying to do. In order to appreciate the 

Student’s purpose, one’s attitude in reading must be one of constructive 

helpfulness rather than negative criticism. 

Unrealized potentialities. Be alert to the unrealized potentialities of the 

theme, the opportunities wasted or used without imagination, Calling these 

to the attention of the student will give a valuable stimulus to future writing. 


11 Adapted from Paul L. Dressel and Lewis B. Mayhew, "Objectives in Com- 
munication,” in General Education: Explorations in Evaluation (Washington, D.C.: 
American Council on Education, 1954), pp- 86-89. 


416 THE IMPROVEMENT OF INSTRUCTION 


4. Interdependence of parts. Recognize the interdependence of the parts of a 
theme. To view a part accurately is to see it in relationship to the whole. 
5. Concluding evaluative judgment. Relate your concluding judgment specifi- 
cally to the subject matter of the theme and see that it is consistent with the 
running commentary. When evaluating a *good" theme, avoid the tempta- 
tion of limiting your observations to minor flaws and concluding with a 
congratulatory message. If the good student is to achieve his best, a full and 


conscientious reading must call attention to both his successes and his fail- 
ures. 


Manipulative Skills 


Micheels and Karnes'* contend that, in determining final course marks 
in industrial arts, teachers give far more weight to the quality of finished 
products than to any other factor, frequently giving little consideration 
to the design and planning of the project and to the procedures followed. 
The fallacy of this procedure is evident when one considers that a student 
might eventually complete a project of high quality and yet, in the process 
of its construction, have done one or more of the following: 


1. Consumed an unjustifiable amount of time in the completion of the 
project. 


2. Asked for and obtained more assistance 


from the instructor and from his 
fellow- 


students than any other member of the group. 
3. Wasted an undue amount of materials. 

4. Performed inaccurate and fault 
project was assembled. 

5. Abused tools and equipment; failed to use 

6. Persistently violated safety rules. 

7. Failed to follow the general 

8. Failed to 
select and adap 
to execute. 

9. Showed no evidence of having developed an appreciation of good design 
and skilled workmanship. 

10. Failed to learn the related informatio. 
esses which was assigned as a part of his proj 


y work which was concealed when the 
them properly, 


procedure as initially planned. 
accept the challenge to design a project of his own or even 
t a design but waited for the instructor to assign him a design 


n about tools, 


materials, and proc- 
ect.13 


Figure 12.5 presents a com 
ing, planning, and executing 


12 William J. Micheels and M. Roy Karnes, Меази 
(New York: McGraw-Hill Book Company, 
13 Ibid., p. 399. 


ring Educational Achievement 
Inc., 1950), p. 398. 


Evaluating Student Performance in the Skills 417 


are used conscientiously by the instructor, in conjunction with (1) study 
of the student's drawings, specifications, and plan of procedure, and (2) 
inspection of the finished project in every detail, using such measuring 
instruments as are necessary to determine the accuracy and quality of the 
student's work. 

Manipulative skills are also involved in home economics. Product scales 
(with samples of different quality levels of hand sewing, French seams, 
bound buttonholes, and the like) can be used to advantage in teacher 
rating of products, in student self-rating, and in obtaining ratings by peers. 
In the development of such product scales, students' ratings of products 
can be utilized. Those samples on which student ratings show high agree- 
ment can be selected to represent approximately equal differences in 
quality. 

After products that represent several degrees of quality have been se- 
lected and scores assigned to them, each student's product can then be 
given the score of the specimen it most closely resembles. Independent 
ratings can be obtained by two or more judges (such as fellow-students) 
and their values averaged. The teacher should personally appraise any 
products for which the student's self-evaluation differs from the average 
rating assigned by classmates. 

The homemaking teacher also finds that observation of performance in 
Process is necessary as an aid in diagnosis and as a basis for reteaching. 
On the basis of a comparison of superior and inferior products in cooking 
and sewing, she can develop checklists, teacher-rating scales, and self-rating 
Scales that will be of great value in improving students’ procedures." 


Athletic Skills 


In physical education both rating scales, and more objectively scored 
Performance tests, are used in evaluating the quality of student perform- 
ance in the component skills involved in gymnastics, track, such individual 
Sports as tennis and badminton, and team games. Whether rating scales or 
More objectively scored performance tests are used, the first step is to 
analyze the component skills involved in a sport. The following analysis 
for basketball is illustrative: 


1. Shooting 
a. Foul 
b. Shooting from the floor 
(1) One hand 


f one- and two-hand shots 
(2) Two hands | река 


14 Clara Brown Arny, Evaluation in Home Economics (New York: Appleton: 
Century-Crofts, 1953). 


418 THE IMPROVEMENT OF INSTRUCTION 


——————M— 


Name: Course: 
Project: Score: 
Instructor: Date: 


Numbers of items which do not apply: 


Directions: Each of the items in this scale is to be rated, 
of 4 points for outstanding quality, degree, compliance, 
for better than average; 2 points for average; 

unsatisfactory or failure. Encircle the appropriate 
Draw a horizontal line through the row of num 
does not apply. Enter the total points earned und. 
composite total in the space at the top of this 5 
of items which do not apply. 


if it applies, on the basis 
or performance; 3 points 
1 point for inferior; and 0 for 
number to indicate your rating. 
bers opposite each item which 
er each major phase, Enter the 
heet. Also indicate the number 


I. Designing Phase (Total Points — —) 


l. To what extent is the project designed or se- 
lected of value to him or to his associates? 0 


1234 
2. To what extent did he evidence sensitivity to 
the elements of good design? 
à. Size, proportion, balance, relative weight of 
parts? 012 3 4 
b. Texture, color surface, and line enrichment? 0 1 2 3 4 
3. Is the material selected appropriate? . , . 0 1 2 $ 4 
8. To what extent were his sketches and drawings 
orderly and generally indicative of good work- 
manship? OL Bee 


II. Planning Stage (Total Points — — ) 

1. Did he obtain the basic information about tools, 
materials, and processes essential to intelligent 
planning? 


0 4 
2. To what extent did he prepare his own plan of did 
procedure? ... 
6. To what extent did he t. 


у 3 ake into consideration 
the time, materials, е 


quipment, and tools avail- 
able? 0123 4 
Ш. Execution Stage (Total Points — — ) 
1. To what extent did he follow the detailed steps 
of his plan? 0 1 2 B X 
2. To what extent did he avoid having to do work 
over because of failure to follow his plan? 0 i 2 3 4 
3. To what extent did he refrain from Spoiling i 
materials by working accurately and carefully? 0 1 2 3 4 


4. To what extent did he follow approved proce- 
dures in performing specific operations? , . . 
13. To what extent was he able to do his own work 


without assistance from instructor Or other 
students? 


Evaluating Student Performance in the Skills 


IV. Completed Project (Total Points_______) 


E 


2. 


7. 
8 


ns, 


To what extent is finished product an embodi- 
ment of original plan? 

Does the general appearance of the project re- 
flect neat, orderly work? 


. Are the dimensions of the actual project the 


same as those on the drawing, within reasonable 
tolerances? 


- How do angular measurements check with those 


specified? 


‚ Of what quality is the finish? 
. To what extent were materials used to best 


advantage? 
Do all joints fit properly? 


. Are all margins uniform? are curved and ir- 


regular lines properly executed, etc.? 


oo 


со 


1 


3 


419 


в 


Fig. 12.5. А Teacher Rating Scale for the Designing, Planning, and 


Executing Stages of Woodwork Projects 


А 


Reprinted by permission of the publisher from William J. Micheels and M. Roy 
Karnes, Measuring Educational Achievement (New York: McGraw-Hill Book Com- 


рапу, Inc. 


, 1950), pp. 408-410. 


eee "= во о ан 


2. Ball handling 


a. Passing (subdivided into different kinds of passes) 


b 
c 
3. Tot 
a 


b 
с 
а 


. Receiving 

. Dribbling (and combinations) 
al body skills 

. Jumping 

. Speed 

. Pivot 

. Endurance!* 


A number of available tests of sports skills, suitable for use in secondary 
Schools, are listed in Clarke!? and in Adams and Torgerson." Since the 


15 Leonard A. Larson and Rachel D. Yocom, Measurement and Evaluation in 
Physical, Health and Recreation Education (St. Louis: The C. V. Mosby Com- 


Pany, 195 


1), pp. 208-209. 


18 Н. Harrison Clarke, Application of Measurement to Health and Physical Edu- 
Cation, 3d ed. (Englewood Cliffs, N. J.: Prentice-Hall, Inc., 1959). 
™ Georgia Sachs Adams and T. L. Torgerson, Measurement and Evaluation for 
the Secondary School Teacher (New York: Holt, Rinehart and Winston, Inc., 1956), 
Pp. 466—483, 


420 THE IMPROVEMENT OF INSTRUCTION 


better skills tests require considerable time to administer, they tend to be 
used only with activities in which instruction has been given for several 
weeks. Rating scales are more frequently used for activities receiving less 
emphasis and for all those skills (such as dancin 


g or swimming) in which 
grace, balance, and form are emphasized. 


VALIDITY AND RELIABILITY OF EVALUATIONS OF 
STUDENT PERFORMANCE IN THE SKILLS 


Validity 


Since we have been concerned in this chapter with the direct measure- 
ment of actual student behavior and actual products made, it may seem 
that no problems of validity are involved and that we may concentrate 
on reliability of measurement. This would be the situation if our sampling 
of each student's behavior were highly representative of his work as a 
whole; and our Scoring of that sample was unaffected by extraneous fac- 
tors. When products can be scored objectively, Such as the speed and 


number of errors in typewriting, we can take validity for granted. However, 
few of the evaluation situations descr 


ibed in this chapter are entirely free 
from the bias of raters or from extraneous factors that obscure our evalua- 
tion of the criterion behavior in which we are really interested. 
f our ratings of performance in process 
we can replay a record of a student's 
age or view a film (in slow motion) of his 
T Observation and scoring is 
ngs affected by our general 
d his evidences of interest in 
qualified judge. 


If we can listen to recordings, view films of 


student is eliminated. If we want to 


know ућеће 
on the average, in their handwritin. 


: 2 spring of the 
year. This would eliminate the effect of wi pud 
validity of our inferences concerning h 

We also increase the validity of scor 


Evaluating Student Performance in the Skills 421 


extraneous factors have minimal effect on student performance. For ex- 
ample, we should have students use the same kinds of equipment (in 
equally good repair) and the same kinds of materials (lumber, dress 
material, and the like) if we are to make valid comparisons with respect to 
their relative skill. 

When subjective judgments must be made by observers, validity can be 
increased by (1) selecting competent observers who know the crucial 
elements of a good performance or product and by (2) training them to 
Observe those aspects of a performance or product that most clearly dif- 
ferentiate students who are rated high or low on much larger samplings 
of performances or products. 


Reliability 


Evaluation in the skills presents difficult problems in achieving an ade- 
quate level of reliability for interindividual comparisons. A review of 
Chapter 4 reveals that the major factors affecting reliability of scores are 
(1) length of test or size of sample, (2) objectivity in scoring, (3) con- 
Sistency in test administration, and (4) appropriateness of the level of 
difficulty for students. Of these, the last two factors are ordinarily easier 
to Cope with than the first two. We have already considered in this chapter 
the desirability of the third factor, that is, having clear, standard directions. 

As far as the fourth factor, difficulty of the test, is concerned, students 
can be most fairly compared if they are all given the same test. In many 
Skills, such as typing, running, and the like, the proficiency of all students 
can be adequately measured in a test situation suitable for all. In other 
Situations, we would have a more efficient and reliable test of skill if stu- 
dents were grouped and the difficulty of the test adjusted for lower-ability 
and higher-ability students. However, unless we had some way. of obtaining 
comparable converted scores (for example, by giving all tests to a sampling 
Of the student population), we would have sacrificed comparability of 
Scores, 

It makes good sense, for example, to test our most able instrumental 
Music students on the most difficult selections. Such a test does not waste 
time in having them play selections that are too easy for them and it pro- 
Vides а more efficient adequate basis for differentiating among the more 
able students, The teacher may choose these gains and give up the advan- 
tages of the uniform test content. Or the teacher may utilize the concept 
ОЁ a uniform “anchor test" for all students plus (a) more difficult musical 
Selections for the most proficient and (b) selections of less-than-average 
difficulty for the least proficient. The more accurately the teacher can peak 
the test to the student’s level of competency, the more adequately the stu- 


422 THE IMPROVEMENT OF INSTRUCTION 


dent can demonstrate his skill in a limited amount of testing time; and the 
more reliable his test results are likely to be on a test-retest basis.!5 

The first factor, size of sample, is a genuine problem in those tests of 
skill where a scorable unit requires considerable student time; that is, so 
much time is required for a student to write one essay, bake one cake, or 
make one blueprint that one cannot include nearly as many items as on a 
paper-and-pencil test. Sometimes one can select crucial elements of a task 
(as in the writing tasks on page 389 of Chapter 11). Or, as was suggested 
on page 403, one could increase the number of tasks administrable in a 
given testing time by having them partially completed. The answer to the 
problem of limited sampling varies from area to area. Certainly, one pos- 
sibility is to cumulate products, or stanine scores on work samples, over a 
period of time. Then letter grades could be assigned on the basis of such 
cumulated data. 

The problem of improving objectivity of scoring is involved in the evalu- 
ation of most skills. There are intrinsic differences from skill to skill in 
the extent to which one must depend on subjective judgment. For exam- 
ple, in the area of physical education, speed of running or height in the 
high jump or rope climb can be scored with as high reliability as we obtain 
in objective tests. Students’ relative skill in body control, as shown in the 
head stand, or their steadiness in the hand stand can be judged with only 
moderate reliability, because we must depend to some degree on the sub- 
jective judgment of observers. Evaluation of a player's form in tennis or 
Swimming is affected even more by subjectivity in judgment. 


referred to the discussion of the SRA Achievement Series in Chapter 13 
19 Paul B. Diederich, "Measurement of Skill in 3 


Writing," Schoo] Review, vol. 54 
(December 1946), pp. 584—592. 


Evaluating Student Performance in the Skills 423 


different aspects of a product; for example, handwriting can be rated for 
beauty and legibility. 

Reliability can be increased by averaging judgments made by two, three, 
or more persons who make their judgments independently (without knowl- 
edge of ratings assigned by others). With older students, who understand 
the importance of learning to evaluate products, the teacher can use a 
Systematic plan for having each student's products (with no name attached) 
evaluated by several of his classmates." 

Another source of error variance is instability in the performance itself, 
ог student inconsistency on different testing occasions. Actually, it is only 
When we get reasonable interscorer agreement and reasonably standard 
conditions of work that we have any chance to study this source of error 
Variance, that is, the variation in student performance from one test sample 
to another. Traxler and Anderson? had students write a pair of two-hour 
essays а few days apart on highly similar topics, "The Discovery of Gold 
in California" and *The Pony Express." They distributed a set of instruc- 
tions for writing the essay, an outline specifying the four main divisions 
of the paper, and a set of unorganized notes as background information. 
Each student was required to base his paper entirely on the notes pro- 
vided, Competent, trained readers showed high interscorer agreement on 
grades. Yet, despite the care the researchers had taken to make the tasks 
comparable and to minimize invalid sources of variance, students’ scores 
9n the two sets of essays correlated only .60. It seems that students do 
Vary more in their essay writing performance than in their performance 
Оп tests of arithmetic or spelling. . 

Reliability, as well as validity, can be increased by selecting raters who 
ате especially competent in judging certain aspects of a performance or 
Product. French made a factor analysis of readers' ratings of essays. He 
identified а large group of readers who emphasized *Mechanics and Word- 
ing.” Interscorer reliability for their ratings of student essays was .70. 
However, the essay scores assigned by these raters correlated so highly 
With students’ scores on objective tests of writing skills that the Board 
decided to use the objective tests only. They made a further study of stu- 
dents’ factor scores on “Ideas,” an aspect of writing ability that can be 
Measured only through actual essay writing. Such scores were almost 
Completely unrelated to the "Mechanics and Wording" scores. 


#0 Simpson developed an ingenious plan for the rating of products by classmates 
50 that each student would have to rate only five products and yet each student's 
Product would be compared with a random sampling of 20 others. The plan is 
Presented in Ray H. Simpson, “Patterns for Rating Learning Products," Educa- 
tional and Psychological Measurements, vol. 13 (Winter 1953), pp. 614—617. 

?! A. E, Traxler and Н. A. Anderson, “The Reliability of An Essay Test in 
English,” School Review, vol. 43 (September 1935), pp. 534-539. 


424 THE IMPROVEMENT OF INSTRUCTION 


Although the “Ideas” score seemed potentially valuable, the reliability 
coefficient was extremely low, only .31. However, when they selected 
readers who assigned considerable importance to “Ideas,” the interscorer 
reliability coefficient increased to .46. Although this reliability is still too 
low for individual scores to be dependable, the increase from .31 and .46 
seems to be attributable to the fact that some persons are more competent 
than others in judging this important and elusive aspect of essay writing. 

Since those aspects of essay writing that could be scored with fairly 
high reliability can be measured even more consistently by objective and 
semiobjective tests, French recommends that further research be conducted 
to improve the effectiveness with which we can measure those aspects of 
essay writing for which objective tests provide no substitute. For example, 
if the topics were selected and the directions worded so that students 
would concentrate on certain aspects of essay writing (for example, organ- 


ization and ideas), reliability of these scores might be further increased. 
According to French, 


If we psychometricians can encourage testing and further clarification of 
those aspects of writing that objective tests cannot measure, encourage the use 
of readers who favor grading those particular qualities that are desirable to 
grade, and see to it that the students are aware of what they are being graded 


on, we can enlighten rather than merely disparage the polemic art of essay 
testing.?? 


This statement illustrates how objective testing and the subjective judg- 
ment of competent judges can supplement each other effe 


с Ч ctively in evalu- 
ating student progress toward major educational goals. 


SUMMARY STATEMENT 


epe of student performance in the skills outcomes of many subject 
на, " t neglected because of the inherent difficulties involved. Such 
m pe eiiis of "performance in process" tend to be of the 
а ulated-situation types. Th 5 

tests should be tasks cruc n Пе one erc 


п ial to the attainment of maj jecti an 
ones which are fairly difficult, not too time то c а 


ni ; consuming, and capable of being 
administered under fairly standard conditi S р 
a- REA 1005 i i able 
objectivity. and evaluated with considerab 


Evaluation of the products of performance tends to be more reliable than 
evaluation of performance in process since more time is available for the 


22 John W. French, "Schools of Thought in Judging Excellence of English 
Themes,” Proceedings of the 1961 Invitational Conference on Testing Problem: 
(Princeton, N. J.: Educational Testing Service, 1962), p. 28. E 


Evaluating Student Performance in the Skills 425 


judging process; independent judgments can be made and compared; product 
scales can be developed; and teachers can be trained in their use. If products 
are ranked, and/or stanine scores are assigned to products, differences in rater 
generosity do not affect students' scores. 

Checklists can be used to summarize information on the methods used by 
students, or on the sequences of steps employed. Rating scales require a quali- 
tative evaluation of the performance or product. Ratings obtained on different 
Steps in a process, or on different aspects of a process or product, are more 
useful in diagnosis than over-all quality ratings. 

A number of suggestions were made for increasing the validity and reliability 
of judgments concerning student performance in the skills. The problems in- 
volved, and hence the best techniques of evaluation, vary considerably from 
one subject area to another. In all areas, however, we are concerned that the 
sampling of skills be as large and representative as feasible; that aids be 
provided to increase the fairness and consistency of grading; that student 
performance be observed or student products obtained under comparable con- 
ditions; and that the tasks assigned be of appropriate difficulty. 


SELECTED REFERENCES 


ADKINS, рокотну c., "Principles Underlying Observational Techniques of Eval- 
uation,” Educational and Psychological Measurement, vol. 11 (Spring 
1951), pp. 29-51. 

AMERICAN ASSOCIATION FOR HEALTH, PHYSICAL EDUCATION AND RECREATION, 
Youth Fitness Test Manual. Washington, D.C.: The Association, 1958. 

ANDERSON, с. c., "The New Step Essay Test as a Measure of Composition 
Ability," Educational and Psychological Measurement, vol. 20 (Spring 
1960), pp. 95-102. 

ARNY, CLARA B., Evaluation in Home Economics. New York: Appleton-Century- 
Crofts, 1953, Chapter 7. 

BEAN, KENNETH L., Construction of Educational and Personnel Tests. New 
York: McGraw-Hill Book Company, Inc., 1953, Chapter 6. 

BAKAN, EDWARD E., “How Do V-Ag Graduates Perform?", Agricultural Educa- 
tion Magazine, vol. 29 (May 1957), pp- 259, 261-262. | Р 
FRENCH, ESTHER L., AND EVELYN STALTER, "Study of Skill Tests in Badminton 
Тог College Women," Research Quarterly of the American Association for 
Health, Physical Education, and Recreation, vol. 20 (October 1949), pp. 

257-272. 

GREENE, EDWARD B., Measurements of Human Behavior, rey. ed. New York: 
The Odyssey Press, Inc., 1952, Chapter 9. | . 

HENDRICKS, В. CLIFFORD, "Laboratory Performance Tests in Chemistry," Jour- 
па! of Chemical Education, vol. 27 (June 1950), PP- 309-311. 

KORAN, SIDNEY W., “Performance Testing in Public Personnel Selection," Edu- 
cational and Psychological Measurement, vol. 1 (July, October 1941), 
pp. 233-252, 365-386. Г : 

MCPHERSON, MARION W., “A Method of Objectively Measuring Shop Perform- 
ance," Journal of Applied Psychology, vol. 29 (February 1945), pp. 22— 
26. 

MICHEELS, WILLIAM J., AND M. ROY KARNES, Measuring Educational Achieve- 
ment. New York: McGraw-Hill Book Company, Inc., 1950. 


426 THE IMPROVEMENT OF INSTRUCTION 


MILLER, FRANCES A., "A Badminton Wall Volley Test," Research Quarterly of 
the American Association for Health, Physical Education, and Recreation, 
vol. 22 (May 1951), pp. 208-213. | 

MILLER, RICHARD T., "A New System of Tennis Stroke Analysis," Athletic 
Journal, vol. 32 (March 1952), pp. 45-46, 75-77. 

PEAK, HELEN, "Problems of Objective Observation," in Leon Festinger and 
Daniel Katz, eds., Research Methods in the Behavioral Sciences. New 
York: Holt, Rinehart and Winston, Inc., 1953, pp. 243-299. 

ROTHROCK, THURSTON M., "Checking the Student's Knowledge with the Cam- 
era," Industrial Arts and Vocational Education, vol. 38 (January 1949), 

. 19-22. 

P conta G., AND NORMAN FREDERICKSEN, "Performance Tests of Educa- 
tional Achievement," in E. F. Lindquist, ed., Educational Measurement. 
Washington, D.C.: American Council on Education, 1951, Chapter 12. 

SIMPSON, RAY H., "Patterns of Rating Learning Products," Educational and 
Psychological Measurement, vol. 13 (Winter 1953), pp. 614—617. 

SIRO, EINAR E., Performance Tests and Objective Observation," Industrial Arts 
and Vocational Education, vol. 32 (April 1943), pp. 162-165. 

THORNDIKE, ROBERT L., Personnel Selection; Test and Measurement Techniques. 
New York: John Wiley and Sons, Inc., 1949, Chapters 1-3, 

WALL, CLIFFORD NATHAN, Н. KRUGLAK, AND L. E, Н, TRAINOR, "Laboratory Per- 
formance Tests at the University of Minnesota," American Journal of 
Physics, vol. 19 (December 1951), pp. 546—555. 

WATKINS, JOHN C., "Objective Measurement of Instrumental Performance," 


Teachers College Record, vol. 44 (February 1943), pp. 376-377. 


WRIGHTSTONE, J. WAYNE, Measuring the Effectiveness of Instruction in Voca- 
tional Education. Alban 


y, N.Y.: University of the State of New York, 
February 23, 1951. 
» “Observational Techniques,” Encyclopedia of Educational Research, 
3d ed., C. W. Harris, ed. New York: The Macmillan Company, 1960, 
pp. 927-933. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. Discuss the values and li 


uss tl ] mitations of tests of “performance in process” in 
homemaking, industrial arts, 


9r some other subject in which the development 
of skills is emphasized. i im: 

2. Discuss the values and limitations of Product scales in the same subject 
field. 


3. Outline your plans for developing a performance test of the work sample 
or simulated-situation type for use in evaluating skills in your major subject 
field. Follow the guidelines that are presented in this chapter for selecting 
tasks for performance tests. 

4. Develop a checklist for recording the sequence of acts or the use of ap- 
proved methods in the performance of some manipulative skill. 

5. Discuss the values and limitations of Tating scales in evaluating student 
skills in a specific sports activity, for example, tennis or diving. 


6. Prepare a guide to be used in appraising students’ laboratory skills, as 
shown in a specific laboratory exercise. 


Evaluating Student Performance in the Skills 427 


7. Obtain several specimens of handwriting, and evaluate them on the basis 
of a handwriting scale of the survey type. 

8. Assume that you are head of an English department; prepare a bulletin 
Of suggestions for teachers of composition on the evaluation of students' 
themes. 

9. Modify the speech rating scale (Fig. 12.4) so as to make it more suitable 
for use on the high school level. 


The Place of Standardized 
13 Achievement Tests in the 


Improvement of Instruction 


Before we consider the current place of standardized achievement testing 
in education, we will trace briefly the history of such testing in the Ameri- 
can schools. Such a review will help us to understand some of the con- 
tributions made by this type of testing, 
why educators are divided as to wheth 
aids to achieving educational goals or 
significant progress, 

Actually, most educators have come 
can be either helpful or h 
they are selected and used. 


а5 Well as some of the reasons 
er standardized tests constitute 
whether they are a barrier to 


с among which test users may make a 
choice, has held up well under criticism, 


HISTORY OF ACHIEVEMENT TESTING 


When one considers the widespread use of standardized tests today, it is 
difficult to realize the youth of objective testing. In fact, it was only about 
a century ago that school enrollments became sufficiently large that uniform 
written examinations were first adopted in the schools of Boston as а 


substitute for the characteristic oral examinations used to pass on the 
qualifications of students, 


1 Robert C. Hall, "Types of Tests Available," School Life, vol. 42 (September 
1959), pp. 10-13. 


The Place of Standardized Achievement Tests 429 
The Beginnings of Standardized Achievement Testing 


Just as group intelligence tests were developed in order to meet a prac- 
tical problem, so the need for group achievement tests arose from a prac- 
tical school situation. As a school administrator, Rice was faced in 1894 
With considerable pressure to bring into the curriculum such new, prac- 
tical subjects as manual training and home economics and also with con- 
siderable opposition on the part of educators who thought that there was 
hardly sufficient time to teach the subjects already in the curriculum. As 
an initial step toward studying scientifically the effectiveness of instruc- 
tion under different time allotments, Rice decided to administer uniform 
tests in spelling in a number of schools. 

The next achievement tests developed were the Stone Reasoning Test in 
Arithmetic, published in 1908, and the Thorndike Handwriting Scale, pub- 
lished in 1909, Because of the number and significance of his contributions 
in the following decade, E. L. Thorndike is generally considered to be 
the father of the educational-measurement movement. 

Beginning in 1910, a number of studies were made on the unreliability 
of teachers’ grading of students. The findings stimulated the development of 
More objective procedures for testing the achievement of students and 
for assigning marks or grades. In one of the most striking of the early 
Studies,* copies of the same geometry paper were marked by 116 teachers 
of high School mathematics; the grades assigned varied from 28 to 92. 

Vidence from studies of English composition and other subjects revealed 
similar inconsistencies. College teachers were found to be shockingly incon- 
sistent when they regraded papers of their own students without knowledge 
9f the marks they had formerly assigned.* 

Such findings gave tremendous impetus to the development and use of 
achievement tests that utilized the objective type of questions, developed 
for use in group intelligence tests during World War I. The need had 
been established, and the techniques had been introduced. As the decade 
of the 1920s opened, the stage was set for large-scale development of 
Broup tests of achievement. 

An important development of the early 1920s was the organization of 
tests into batteries. In 1922 the first edition of the Stanford Achievement 
Test appeared. In revised forms, this has continued to be one of the lead- 
ing achievement-test batteries. By administration of a test battery (which 
includes subtests on the skills of reading, arithmetic, language, and other 
Subjects), measures could be obtained of children’s comparative achieve- 


? Daniel Starch and Edward C. Elliott, “Reliability of Grading Work in Mathe- 
matics,” School Review, vol. 21 (April 1913), рр. 254-259. 

5 Daniel Starch, “Reliability and Distribution of Grades,” Science, vol. 38 (Oc- 
tober 1913), рр. 630-636. 


430 THE IMPROVEMENT OF INSTRUCTION 


ment in these different areas; and the achievement of class and school 
groups could be interpreted in comparison with age or grade norms (the 
average achievement for children of the same age or grade level). 


Use of the New Tests in School Surveys 


Administrators and supervisors soon accepted the new testing tech- 
niques as valuable tools by which to compare the achievement of classes 
and to rate the efficiency of teachers. City, county, and state surveys of 
school systems flourished during the 1920s. In these surveys, group tests 
of intelligence and achievement were used extensively in an effort to 
determine the efficiency of instruction and to study other administrative, 
Supervisory, and curricular problems. The average scores obtained by 
classes, grades, and schools were compared with national norms for the 
tests. Objectivity and efficiency in education were emphasized. 

This period was also characterized by increased interest in the range of 
individual differences revealed by testing programs, and in educational 
research, especially of the type related to determining the efficiency of 
School services. On the debit side, however, there must be recognized а 
tendency toward overconfidence in tests and frequent misuse of test results 
às а basis for judging the quality of teaching. Moreover, the growing mar- 


ket stimulated overproduction of tests, many of which were inferior in 
quality and in standardization. 


Reactions to Criticisms of Standardized Tests 


The 1930s and 1940s were characterized b 
usefulness of tests in the schools and the de 
attitude regarding the values of specific test: 
movement in education emphasized the fa 


almost exclusively with elementary skills and with facts to be memorized. 
Critics of standardized tests reminded teachers that these tests did not 
measure progress toward the ultimate goals of education—understandings, 
attitudes, and appreciations. It was pointed out that a composite of the 
results of currently used tests for a given individual did not provide а 
true or complete picture of the individual as a wh 


ole. 

Such criticisms led to attempts to obtain more comprehensive appraisals 
through the use of anecdotal records, interviews, and case histories, as well 
as the development of tests concerned with higher-order cognitive learn- 
ings, which go beyond information and skills. Use of these measures 
tended to provide a better evaluation of the child in terms of his many- 
sided patterns of development. 


у a general acceptance of the 
velopment of a more critical 
5. Leaders of the progressive 
ct that tests were concerned 


5 


The Place of Standardized Achievement Tests 431 
ummary of Historical Trends in Achievement Testing 


Although space does not permit us to include many of the interesting 


milestones in the development of achievement testing, the following gen- 


e 
a 


1, 


ralizations will provide background for the study of current uses of 
chievement tests. 


The testing movement helped to arouse the profession to the extent and 
significance of individual differences in student achievement and readiness 
for new learnings. 


2. The inadequacy of early tests and the misuse of test results led to certain 


undesirable outcomes: (a) standardized tests were frequently used in 
Schools without consideration of their appropriateness in the local educa- 
tional program; (b) results of survey testing were frequently misused as а 
basis for judging teaching efficiency; (c) as a result, the teaching in many 
Schools became largely the coaching of students for test passing; and (d) 
Since tests failed to evaluate student growth on a sufficiently broad basis, 
teachers’ emphasis on tested educational outcomes led to an undesirable 
narrowing of the educational program. What was most easily measured 
became most important. It is not surprising that many teachers resented 
the use of standardized achievement tests and that antagonism toward 
lesting developed on the part of many educators who were striving to 
broaden and vitalize the educational program. 

- Curriculum change and a growing emphasis on child study led to needed 
modifications in measurement and evaluation. Many standardized tests, 
Which had measured information only, were broadened so as to measure 
Student understandings and application of principles. 


4. There has been increased awareness among administrators of the fact that 


Standardized tests measure student growth toward only a limited number 
of educational goals. Hence greater caution is being exercised in making 
inferences from test data concerning the general teaching effectiveness of 
faculty members. Administrators have also learned to take into account in 
their interpretations of test data for classes and schools the many factors 
that affect school achievement, such as the scholastic aptitude and cultural 
background of students. | | 

* Аз it became apparent that published tests could provide only a partial 
answer to the problems of appraising student growth, teachers were given 
Preservice and inservice education in the development of their own tests 
and in the use of other techniques of assessing student growth toward 
educational goals. The first book designed to help teachers in improving 
their own examinations was written in 1924.* In the 1930s the work of the 
Eight-Year Study® in secondary schools provided a tremendous stimulus to 
the development of test exercises that went far beyond the testing of 
knowledge and included most of the major objectives listed in the taxonomy, 


+G. M. Ruch, The Improvement of the Written Examination (Chicago: Scott, 


Foresman and Company, 1924). 


2 Eugene К. Smith, Ralph У. Tyler, and others, Appraising and Recording Student 


Progress (New York: Harper & Row, Publishers, Inc., 1942). 


a DIAGNOSTIC PROFILE (Chart student's ranks hore) 


TOTAL (А+8+С+0) 70 43 


qr QU Percentile Kank 
of S зе 3040 50 60 70 80 90 95 9899 
è S v < 
el PART | 11 
E [2 
<„ | А. Economic Growth and Develop. 20 
ZS | v. sociol & Cultural Liteot People 20 I 
= 
52 C. Develop. of Political Democ, 25 19 + З 
о вс 1 i 
58 TOTAL (A+B+C) 65 28 1 : 
T 
Н 1 
|: 1 
i 
І 
ЕБ | A Economic Growth ond Develop. 20. /O 
FE | 0, Sociol & Cultural Life of People 15 H s 
ER C. Develop. cfPeliticolDemoe, 15 7 | ЯН 
a — 
45 | TOTAL aster 50 as | i 
T п 
i 
Percentile Ronk! i i ! 
PART II 12 5 10:20 3040 50 60 70 80! 90 95 9899 
9 A tow | AVERAGE | HIGH 
u$ А. Economic Growth and Develop. 20 САЗ 1 7\2 i 12.201 
5 7 1 
58 B. Social б Cultural Life of People 15 4 7 б | 
EK C. Develop. of Political Democ. 20 A ' 
$ 52/ 40 | 
= | TOTAL (A+B+C) 55 Н 
T 
1 
BA | 
ZZ | А. Economic Growth and Develop. 25 Н 
55 В. Social & Cultural Lite of People 15 [R A i 
ES |С. Develop. of Political Democ, 25 А 1 i 
Fr — x T 
= | TOTAL aste 6335 60 i ! ! 
Percentile Rank! ! | і 
| i H 
PART Ш ! 2 5 10120 304050 60 70 80! 90.95 9899 
tow A 
А. Nature of Our Environment -21 O 17 On VAN ap вен 
За |B. How Our Environment Changes 15 D A 
оо 
#54 С. Using Natural Resources - - -16 7 L 
= 
ER |p, Using Power for Better Living 18 // A 
S 70 


A. Nature of Living Things --- 4 QD Д 
B. Plant Life & Its Importance 20 /O A 
3. C. Animals: Relations to Мол - - 18 // A 
95 |D. Bodily Functions ond Health -20 /7 А 
SU [E санаанан... 5/2 A 
5 |е Disease ond Human Welfore . 18 L 
TOTAL А+В+С+р+Е Р! 105 65 60 


*For Test 6, the scores above the lines ore for 
9th grade general sci - The scores below 
the lines are for 10th grade biology. 


Standard Score 
Fig. 13.1 Sample Profile for the California T. 
Sciences, Advanced Battery, 
dent in May. 


ests in Social and Related 
Administered to an Eleventh-grade Stu- 


Reproduced with the permission of The California Test Bureau (Monterey, Calif.: Cali- 
fornia Test Bureau, 1954). 


The Place of Standardized Achievement Tests 437 
Aid in Student Decisions 


Problems 6 and 7 in Chapter 1 had to do with the advisement of stu- 
dents regarding the choice of college-preparatory subjects and of vocations. 
Achievement tests of the generalized-outcome type would be an aid to 
counselors in these situations. Such tests correlate highly with scholastic 
aptitude and are much more easily discussed with students and parents 
than are scholastic aptitude tests. 


LEADING ACHIEVEMENT TESTS 


With the exception of end-of-course examinations in high school subject 
fields, schools generally prefer to administer achievement test batteries 
rather than separate subject tests in reading, arithmetic, language and other 
areas. The chief reason for their preference is that test batteries provide 
comparable scores for students in all subtests, based on the same 
norming samples. m. 

Selection of an achievement test battery for a school district is a very 
important decision. It is usually desirable to use such a battery at several 
grade levels and over a period of years to obtain comparable data for use 
in evaluating student progress. The emphases in the battery can, over a 
period of years, influence students and teachers in their distribution of 
study time. 

In making their own judgments concerning the value of a test for local 
use, members of a test selection committee should consult the reviews of 
published tests in Buros’ Mental Measurements Yearbooks and in pro- 
fessional journals, taking advantage of the judgment of experts to help 
narrow down the number of test batteries to be considered. For each test 
battery under consideration, committee members should examine the test 
itself (preferably by taking the test). Such an examination should reveal 
the extent to which the content is keyed to the objectives of the local course 
of study and should identify factors that might invalidate results for local 
Broups (for example, disproportionate emphasis on specific facts, test so 
easy that it would fail to measure the best achievers adequately, and the 
like). The committee should also summarize the data in the manual 
relevant to topics considered in the summary form in Chapter 5. 


Achievement Test Batteries for Both Elementary and Secondary Schools 


Of the most widely used achievement test batteries, only two span the 
elementary and secondary school years with articulated tests measuring 
essentially the same pattern of learning outcomes throughout all grades. 
Those two batteries are the California Achievement Tests (CAT) and the 


438 THE IMPROVEMENT OF INSTRUCTION 


Sequential Tests of Educational Progress (STEP). These batteries have 
both been included in dual standardization programs (that is, with a com- 
panion test of scholastic aptitude, the California Test of Mental Maturity 
and the School and College Ability Tests respectively, being administered 
to the same students). The grade range for the CAT tests is grades 1 
through 14; for the STEP tests, grades 4 through 14. 

There are major differences between the two batteries. The CAT tests 
limit their coverage to the fundamental skills of. reading, mathematics, and 
language," while the STEP series extends the communication skills to 
include listening and writing skills and also includes tests in social studies 
and science. The STEP series is the only achievement test battery now 
available that includes (1) a test of listening ability and (2) an essay 
test, in which students write themes on selected topics, which are to be 
evaluated by semiobjective, clearly defined standards. Both test batteries 


аге so designed that the user can choose to purchase and administer tests 
in one or more of the major subject areas. 


The STEP tests are designed to measure * 
of learning." Devised with the advice and ass 
mended by national professional groups (in 
other subject areas), these tests place minimu 
greater emphasis on the higher levels of cog 
summary of skills sampled by the science an 
the reader to understand this emphasis on 
than upon knowledge of specifics: 


critical skills in application 
istance of educators recom- 
English, mathematics, and 
m emphasis on memory and 
nitive abilities, The following 
d social studies tests will help 
generalized outcomes, rather 


Skills sampled in STEP science tests 
The ability to 


Identify and define a scientific problem. 
Suggest, screen, and test a hypothesis. 

Design experiments and to collect data. 
Interpret data and draw conclusions. 

Evaluate critically the printed and Spoken word. 
Reason quantitatively and symbolically. 


со ш оюк 


Skills sampled in STEP social studies tests 
Ability to 


1. Read and interpret maps, 


charts, cartoons, pictures, diagrams, as well as 
the printed word. 


11 The California Test Bureau publishes se 
studies, and science, listed in the Appendix unde 
Survey (grades 7-13) and California Tests of 
tary (grades 4-8) and Advanced (grades 9-12). 


Parate tests in study skills, social 
T the titles California Study Methods 
Social and Related Sciences, Elemen- 


The Place ој Standardized Achievement Tests 439 


2. Think critically, to distinguish fact from opinion, and to recognize propa- 
ganda. 

3. Assess and interpret data. 

4. Apply appropriate outside information and criteria. 

5. Draw valid generalizations and conclusions.!? 


This type of test design, in which students are presented with unfamiliar 
problems, has a few disadvantages. Fewer items of this type can be admin- 
istered within an hour of testing time. Hence, with the exception of the 
essay test, each of the tests (in reading, language, and the like) requires 
70 minutes of testing time. Another characteristic of this approach is that 
the tests become highly verbal tests; that is, a large percentage of the 
variance in scores on all tests of the series is attributable to differences in 
the students’ verbal ability. At the fourth-grade level, scores on the STEP 
mathematics test correlated more highly with the verbal scores of SCAT 
(School and College Ability Test) than with the quantitative scores.’ 

The STEP series is outstanding in its carefully designed test items, its 
use of percentile bands'* to emphasize errors in measurement, and its aids 
to the teacher and students in the interpretation of test results. The CAT 
Series also has many aids to teacher use of test results. The Scoreze book- 
lets for the CAT tests combine the advantages of machine scoring with 
Carbon copies of answer sheets, which clearly indicate the items each 
student has missed and the types of learnings tested by them. Diagnostic 
analyses have also been prepared for each test. Although these analyses 
have been criticized for giving users a misleading impression of the diag- 
nostic value of the test, they can provide diagnostic leads. However, only a 
few items of each type are included; hence teachers must obtain further 
evidence concerning the validity of any hypotheses growing out of their 
study of these diagnostic clues. The use of this type of aid is considered 
further in Chapter 14. А 

Both tests provide national norms. The CAT provides grade placement 
norms throughout the full grade range, although their use at the high 
School level is of limited value. Percentile norms for each grade are also 
provided. The STEP tests use only percentile norms. Although grade 
placement norms have certain limitations, discussed in Chapter 2, they 
also have certain advantages for the elementary school grades (in the 


measurement of gains, and in the comparison of class and school results 


With national averages). 


1? Cooperative Sequential Tests of Educational Progress (Princeton, N. J.: Edu- 


Cational Testing Service, 1957). | 
13 Anne Anastasi, Psychological Testing (New York: The Macmillan Company, 
1961), p. 448. 
14 See illustrative profile in Chapter 5, page 164. 


440 THE IMPROVEMENT OF INSTRUCTION 


The provision of anticipated achievement norms constitutes a unique 
feature of the CAT tests. Use of AAGP norms, discussed in Chapter 14, 
makes it possible to compare the scores of each student with those for 
students in the norming sample of comparable age, grade, and scholastic 
aptitude and to know the approximate standard error for this type of 
comparison. 

It is apparent that each of these two series incorporates many features 
designed to make it valuable for use in schools. Although a few differences 
have been pointed out between the two series, the choice between them, 
or among these and several other batteries available should be made chiefly 
on the basis of (1) the examination of test content in terms of its validity 


for the local educational program, (2) a careful study of their manuals, 
and (3) the reviews in the Buros Yearbooks. 


Achievement Test Batteries for the Elementary and Junior High 
School Grades 


It is easy to understand why several achievement test batteries do not 
include the upper secondary school years. In many school districts, city-wide 
achievement testing programs are conducted only in grades 1 through 8. 
In the higher grades, students begin to take differentiated programs; hence, 
the all-school testing program often becomes limited to tests designed to 
aid students in their choices of curricula and in post-high-school planning. 

In addition to the CAT and STEP tests, four widely used achievement 
test batteries are available for the elementary school grades. 


IOWA TESTS OF BASIC SKILLS (GRADES 3-9) Includes tests of vocabulary, 
reading comprehension, lan 


guage skills (spelling, italizati ation. 
usage), and work-study skills peling, capitalization, punctuation, 


This test includes a test on work-study skills for even the youngest children. 
Its emphasis is on functional skills, rather than specific information. More 
subscores in language are provided than in some other tests. The battery is 
unusual in that tests for all areas and grade levels are included in one spiral- 
bound reusable test booklet. Tests for each Brade are adapted specifically to 
that grade but utilize some of the test items for adjacent idee Р 


METROPOLITAN ACHIEVEMENT TESTS 


Primary 1 (for second half of grade 1) includes tests of word knowledge (sight 
vocabulary), word discrimination (selecting orally presented Seal from 
set of printed words), reading comprehension, arithmetic concepts and 
skills. 

Primary П (for grade 2) includes these tests 

Elementary (grades 3-4) adds language and two Separate tests on arithmetic 
(computation, and problem solving and concepts), 


Intermediate (grades 5-6) has a partial battery, including the tests in the 


plus spelling. 


The Place of Standardized Achievement Tests 441 


Elementary battery (with the exception of word discrimination). The com- 
plete battery also includes science and two tests in social studies (informa- 
tion and study skills). 
Advanced (grades 7-9) has partial and complete batteries with the same sub- 
tests listed above. 
The range of content of the Metropolitan batteries for grades 5 and above 
is almost as comprehensive 25 that for the STEP tests, except that the latter 
includes skills in listening and writing. 


SRA ACHIEVEMENT SERIES 


Four batteries (grades 1-2, 2-4, 4-6, and 6-9), measuring skills and under- 
standing in four general areas: (1) reading, (2) arithmetic, (3) language, and 
(4) work-study skills. All four batteries include under reading subtests on 
comprehension and vocabulary; in addition, the battery for grades 1-2 includes 
a verbal-pictorial association subtest, which measures the ability to compre- 
hend isolated words, phrases, and sentences and а language perception subtest 
(including auditory discrimination, visual discrimination, and sight vocabulary). 
Language arts tests for the highest three levels provide subtest scores on 


capitalization and punctuation, grammatical usage and spelling. Arithmetic 


i i i i i ts and usage. 
tests in all batteries provide subtest scores 1n reasoning, concep ge, 
and computation. The work-study skills tests at the two higher levels provide 


subtest scores in references and charts. 


This test series differs from the other batteries in at least two respects. 


i ivati dents, the 
1. Instead of assuming high-level motivation on the part of all stu 7 
ашћогѕ на presented "ihe items in the three lower batteries in story form 


in order to elicit greater pupil interest. In the preliminary try-outs, иш 
reactions to story interest were obtained and revisions made in terms O 
these reactions In order to keep the tests in arithmetic and language from 


i i bility, special effort was 
being i duly by reading comprehension à у, Sp у 
сламе = pepe difficulty of these tests well below the reading 


level of the grade being tested. 

The story approach also represe 
less artificial and measurement of 
ample, vocabulary items are presen 


nts an attempt to make the test situation 
criterion behavior less indirect. For ex- 
ted in context, with the pupil selecting 


А t. The arithmetic test items are 

the meaning appropriate to the context. › З 8 

grouped ing арр lated situations in such à та а erae БА 
formance can be readily interpreted in terms of fal P 

ance can бе E e arithmetic procedure, Or errors 


М s inappropriat А 
correctly, illogical reasoning, € Т resulted in greater test length, just 


a ав. BE STEP series required longer tests to achiove 
adequate reliability. eae 
* T га ned for pupils who are 20 ieving g 
а da tp uds how superior achievement for that grade re 
The slow learners will be identified but not measured by the test. The authors 


me; a те! pup i atter designed 
recom ica tarded ри ils be given the battery ‹ 
nd that academic Шу ‹ isi Ta 


i hi 
нар е t] and publishers of achievement test bat- 


diffi facing all a i 1 А Я 
: = ae of this problem is quoted from the Technical Supple 


ment for that test. 


442 THE IMPROVEMENT OF INSTRUCTION 


со 


~ 


Grade Level 


Number of Test Items 
Fig. 13.2 Effects of Various Test Gradients. 


Reproduced with the permission of Science Research Associates from Louis Р. Thorpe, 


D. Welty Lefever, and Robert A, Naslund, Technical Supplement, SRA Achievement Series 
(Chicago: Science Research Associates, 1957), p. 5. 


Three possible policies for the construction of the tests and the sampling 


f these procedures has advantages 
as well as characteristic weaknesses: 


l. The test could be made to cover a wide range of abilities in a given 
curricular area, but with relatively few items at each level of difficulty [Figure 
13.2, line A]. It would contain items appropriate for severely retarded, as well 
as for gifted, pupils. At least two serious limitations are likely to characterize 
such a test. First, the small number of items at each difficulty level will not 
yield dependable measures of pupil achievement. In Short, there would be too 
few items having just the right range of difficulty for each pupil. Second, a 
reduced sample of test items would also result in inadequate curricular cov- 
erage at each grade level. 

2. The test could be designed to contain a rich sampling of items at all 


The Place of Standardized Achievement Tests 443 


curriculum levels, appropriate both to the poorest achiever and the most skilled 
pupil. While an excellent measure of achievement at all levels of ability would 
be assured, the total testing time would rise beyond feasible limits. In the fifth 
grade, for example, in order to develop an adequate reading test in terms of 
this policy, it would be necessary to include a considerable quantity of reading 
material with questions suitable in difficulty for children all the way from the 
Second grade to the tenth grade [Figure 13.2, line B]. The other levels of ability 
involved would require similar sampling. 

3. A third policy, the one adopted for the Series, is to provide an adequate 
sample of items for each difficulty level covered, but at the same time to reduce 
the total range of difficulty by "lopping off" the lower end of the scale [Figure 
13.2, line C]. This design is unusual for achievement tests and thus requires 
à brief explanation. 

Each battery has been so constructed that it does not contain easy items 
Suitable for the seriously retarded pupil to answer correctly, and only a rela- 
tively few items simple enough for the low-average learner to handle success- 
fully. However, the upper level of each test battery has been extended suffi- 
ciently so that it overlaps the next higher battery to a considerable extent. 

For example, the $RA Achievement Series, 2—4 has a range from grade 2.5 
to grade 4.9. Since this battery might be administered to a fourth grade class 
at the close of the school year, it contains test items difficult enough to 
Measure the most able fourth grade pupil at that time. This means that this 
battery includes a considerable sampling of fifth and sixth grade test items.!^ 


have found that it is possible for students mark- 
ance scores almost at grade level on leading 


achievement batteries. Hence, it is highly desirable that either tests be 
lengthened (alternative 2) or that tests disclaim the ability to measure 
Over such a wide range ( alternative 3) and test users assume responsibility 
for administering less difficult tests to slow learning pupils. 


‚ Both Hopkins" and Sax" 
ing items at random to obtain ch 


STANFORD ACHIEVEMENT TEST 

Primary (grades 1-3) includes two tests of re 
and paragraph meaning); two tests of d 
and arithmetic reasoning), and a spelling test. 


ading ability (word meaning 
etic (arithmetic computation 


8 Louis p, Thorpe, D. Welty Lefever, and Robert A. Naslund, Technical Sup- 
plement, SRA Benene Series (Chicago: Science Research Associates, Inc., 
1957), pp. 4-5. | 
1% Kenneth D. Hopkins, “Validity Concomita 
hich Attenuate the Effects of Response Sets 


thesis, Universi ven tn ЛОЙ. 
А versity of Southern California, | | 
Gilbert p " “Theoretically Derived Chance Scores and Their Normative 


uivalents on a Selected Number of Standardized Tests,” Educational and Psycho- 
gical Measurement, vol. 22 (Autumn 1962), pP- 573-516. 


nts of Various Scoring Procedures 
and Chance," unpublished doctoral 


444 THE IMPROVEMENT OF INSTRUCTION 


Elementary (grades 3—4) includes the tests listed above and adds a three-part 
language test (capitalization and punctuation, sentence sense, and usage). 
Intermediate (grades 5-6) includes the tests listed above in its partial battery. 

In the intermediate battery, complete, are also included additional tests on 
social studies and science. 
Advanced (grades 7-9) has the same organization as the intermediate battery. 


The student will recognize that this test covers essentially the same areas 
as the Metropolitan Achievement Tests, published by the same company. 
The Metropolitan, however, has tests of two levels of difficulty for the 
primary grades and includes three subtests of reading. The content of the 
Stanford Achievement Tests, in the social studies and science areas, has 
been criticized in the Buros Yearbooks and elsewhere as emphasizing 
unrelated factual questions, rather than the understanding of significant 
concepts and principles. These criticisms, however, do not apply to the 
1964 edition, which is based on a content analysis of more recently pub- 
lished textbooks. 


Achievement Test Batteries for the High School Level 


In addition to the CAT and STEP series discussed earlier, there are 
three other achievement test batteries designed for use at the high school 
level. All three are designed to help predict student success in college and 
to help students recognize areas of weakness in their preparation for 
college. However, they differ markedly in their length, cost, and the types 
of educational outcomes emphasized. 

The Iowa Tests of Educational Development (YTED) were developed 
to measure the student's progress in the development of broad intellectual 
skills. These tests emphasize understanding of what the student has learned 
and his ability to apply his learnings, rather than his recall of specific facts. 
They resemble the STEP tests in their basic approach. The STEP tests 
differ from the ITED in that they represent an articulated series from 
grades 4 through 14 and in that they include tests of listening and 
essay writing. 

The following comparison of tests in the two series may be useful: 


STEP, LEVEL 2 ITED 
GRADES 10—12 GRADES 9—12 
Mathematics Test 4 Ability to do quantitative thinking 


Í Test 2 General background in the natural sciences 
Test 6 Ability to interpret reading materials in natural sciences 


Science 
| Test 9 Use of sources of information 


The Place of Standardized Achievement Tests 445 


STEP, LEVEL 2 ITED 
GRADES 10-12 GRADES 9-12 
А Test 1 Understanding of basic social concepts 

Social Studies J Test 5 Ability to interpret reading materials in social studies 
| Test 9 Use of sources of information 
[ Test 7 Ability to interpret literary materials 

Reading 4 (See also tests 5 and 6 above) 
| Test 8 General vocabulary 

Writing Test 3 Correctness and appropriateness of expression 

Essay 

Listening 


It is evident from this comparison that the ITED has more test material 
than STEP in the areas of social studies and science. In each of these 
Subject areas, one can note from ITED scores whether a student seems to 
have a deficiency in background knowledge, ability to interpret text ma- 
terials in the field, or in ability to locate and use reference materials (as 
reflected in test 9). On the other hand, STEP offers more scores in the 
Important area of communication skills. 

Expectancy tables have been developed so that one can predict from a 
Student’s scores at any grade level (in high school) his probable score 
On the tests of the College Entrance Examination Board, and his probable 
academic success in three types of colleges. Profiles are available that show 
the average ITED scores earned in high school by students majoring in 
eleven different college areas. Each student receives a profile leaflet entitled 
"Your Scores on the ITED and What They Mean," while counselors receive 
Copies of a guide on “How To Use the Test Results.” 

The total score predicts college grades with unusually high predictive 
Validity, with some validity coefficients approaching .60. This level of 
Validity is attributable in part to the length of the test, eight hours of stu- 
dent testing being required unless shorter testing times (optional with the 
user) are employed. 

The Essential High School Content Battery is a shorter test, which can 
be administered in three and a half hours. Although norms are available 
for Students in commercial, science, general, and academic curricula, the 
lest is best suited to college-preparatory students, The four subtests 
(mathematics, science, social studies, and English) cover typical content 
Of required courses. Greater emphasis is placed on factual knowledge than 
in the ITED. However, items are well constructed; and the science section 
Includes items on application of principles. 

For a high school that wants to measure student achievement of im- 


446 THE IMPROVEMENT OF INSTRUCTION 


mediate objectives in these academic areas and use the total score as a 
basis for predicting college success, this test provides an economical sub- 
stitute for the longer STEP or ITED batteries. 

Still less time (only two hours) is required for administration of the 
Cooperative General Achievement Tests for grades 9—13. This battery 
includes tests in three subject areas (social studies, natural sciences, and 
mathematics). The subtests tend to emphasize generalized outcomes more 
than does the Essential High-School Content Examination. Each test has 
two parts, Part I emphasizing an understanding of terms and concepts and 
Part II, the ability to interpret materials and problems relevant to the 
field. This battery is essentially designed to measure proficiency in three 
areas important to future achievement in college. English is not included, 
probably because of the widespread use of the Cooperative English Tests 
(in reading comprehension and English expression). If the English tests 
are also administered, the total time is comparable to that for the Essential 
High School Content Examination. 


End-of-Course Achievement Tests for the High School Level 


The student will find many tests in high school subjects listed in the 
Appendix. Only through an examination of such tests and their manuals 
can he appraise their validity for his own purposes. It is desirable, how- 
ever, to mention two leading series of course-oriented achievement tests. 
In each of these series, a common score scale is used in all tests of the series. 

The Cooperative Test Service of the American Council on Education 
pioneered in the development of end-of-course examinations in high school 
subjects. Subject-matter specialists in each high school subject worked 
with specialists in test construction to develop a large number of tests that 
would be acceptable to many teachers. In 1948 this test service became the 
Cooperative Test Division of the Educational Testing Service. 

The various revisions of the Cooperative Mathematics Tests have in- 
cluded a comprehensive 80-minute test for grades 7 through 9 and 
one-period tests in (1) elementary algebra (through quadratics), (2) inter- 
mediate algebra (quadratics and beyond), (3) plane trigonometry, (4) 
plane geometry, and (5) solid geometry. In 1962, new tests in arithmetic, 
algebra, and geometry were published. These tests were designed to reflect 
some of the newer emphases in mathematics, but important aspects of 
traditional mathematics are also measured. Additional tests in third-year 
algebra, trigonometry, analytic geometry, and calculus are to be pub- 
lished in 1964. 

The Cooperative Science Tests include an 80-minute test for grades 7 
through 9 and also one-period tests in (1) general science, (2) biology, 
(3) chemistry, and (4) physics. A special series of unit tests has also 


The Place of Standardized Achievement Tests 447 


been prepared for those schools using the physics course developed by 
the Physical Science Study Committee. The Cooperative Test Division is 
engaged in a major revision of its course-oriented tests in science, which 
are scheduled for publication in 1964. 

The Cooperative Social Studies Tests include an 80-minute test for 
grades 7 through 9 and also one-period tests in American history, Ameri- 
can government, ancient history, modern European history, and world 
history. In the revised series, scheduled for publication in 1965, both 
junior high school and senior high school tests in American history will 
be available, as well as revised tests in American government, world history, 
and modern European history. New tests in civics and problems of 
democracy are also being developed. 

The Cooperative Foreign Language Tests now include one-period tests 
in French (elementary and advanced), Spanish (elementary and ad- 
vanced), and Latin (elementary and advanced). A listening compre- 
hension test in French (with tape-recorded selections) is also available. 
In 1964 the Cooperative Test Division will publish a series of new tests, 
Which are being developed in cooperation with the Modern Language 
Association and the United States Office of Education. This comprehensive 


testing program will cover 


+ Five languages (French, German, Spanish, Italian, and Russian). А 
Four skills in each language (reading, writing, listening, and speaking). 
- Two levels in each skill (beginning and intermediate). 

· Two equivalent forms for each level. 


Bo 


Multiplication of the numbers listed above indicates that this program 
involves 80 new tests. This series includes the first standardized tests ever 
developed of ability to speak foreign languages. Obviously this program 
represents a tremendous advance in measurement in this subject area, an 
advance that could not have been made without financial subsidy and the 
Cooperation of many professional workers in both the subject fields and 
the field of professional test construction. : 

ies of high school achievement 


Fortunately for test users, another seri 
: P » ч 
lests is also available. We use the term fortunately" because of our con 


Viction that the test user should be able to choose among available achieve- 
Ment tests in terms of their content validity for his purposes. In fact, the 
teacher will find listed in the Appendix other published tests that may 
Serve his needs better than tests from either of these series. 

The Evaluation and Adjustment Series, published by Harcourt, Brace & 
World, Inc., is the other leading series of high school achievement tests. 
This series includes more than 20 subject tests, which are listed in 
Appendix A in the sections on language, mathematics, reading, science, 
Social studies, study skills, and miscellaneous. 


448 THE IMPROVEMENT OF INSTRUCTION 


This series offers some tests (such as those in civics, health knowledge, 
and psychology) that have no parallel in the Cooperative series. On the 
other hand, the Cooperative series includes tests in foreign language and 
in a few other subjects (for example, solid geometry, ancient history, and 
modern European history) that have no parallel tests in this series. 

Two features of the Evaluation and Adjustment Series should be em- 
phasized. One is the availability of пет norms, which are of great value 
in group diagnosis (as illustrated in Tables 13.1 and 13.2). An additional 
advantage is that individual teachers can have students omit certain ques- 
tions, or they can combine selected questions from two forms, as suggested 
in Table 13.3. By obtaining the average value of item norms for questions 
used, a teacher can still compare his class average with average student 
achievement on these items by the norming sample. 


Table 13.3 
Hypothetical Example of the Use of a Standardized Test as A "Final 
Examination"* for Teachers' Classes That Have Given Somewhat 
Different Emphases to Various Aspects of the Subject 


es 


1. Each teacher of American history in the school district independently reviewed 
all items on the two forms of the Crary American History Test. On accompanying 
answer sheets, he rated each item according to the following scale: 


A—Essential concept or item of information; should be learned by every student 
B—Of major importance 

C—Fairly significant; usually covered 

D—Comparatively unimportant, or inappropriate for this grade level 
E—Inconsequential, trivial 


2. Of the 180 items on the two forms, 95 were given either an A or B rating by 
10 of the districts 12 teachers of American history. These items constituted an 
"anchor test." 

3. Each teacher selected an additional 30-40 items that he wanted to have included 
in his final examination. The remaining 45-55 questions (which differed from 
teacher to teacher) could be deleted from answer sheets so that students would 
not have to spend time on them. However, if desired, all questions could be 
administered to students with the understanding that any question not considered 
in their text or class discussions would be deleted from their final examination.” 

4. When the tests had been administered, all tests for the school district were scored 
on the anchor test key. Then the tests for each teacher were scored for the 
additional questions he wished to have included on his end-of-course examination. 

5. Local stanine norms for the anchor test were established by graphing the fre- 
quency distribution on the Otis Normal Percentile Chart. 

6. An "expected median raw score" for each teacher's "final examination" was ob- 
tained by averaging the "percentage right" values for all test questions included 
in his examination. These item values were obtained from the test manual. Such 
an expected median score, of course, made no allowance for differences between 
local classes and the norming population, with respect to average intelligence or 


The Place ој Standardized Achievement Tests 449 


Table 13.3 (Continued) 
Hypothetical Example of the Use of a Standardized Test as A "Final 
Examination"^ for Teachers’ Classes That Have Given Somewhat 
Different Emphases to Various Aspects of the Subject 


ee 


reading achievement. Such allowance, however, could readily be made on the 
basis of data provided under 7a below. 

7. Each teacher received a report for each of his classes, which gave the following 
information: 


a. The distribution of stanine scores for each class on a reading comprehension 
test that had been given early in the school year. This information provided 
a crude basis for judging how well the class might reasonably be expected to 
achieve in a course that required considerable reading. 

b. The distribution of stanine scores? for each class on the anchor test (com- 
posed of questions to which local teachers had assigned A or B ratings). 

с. An item count (of the type shown on page 479) for all the questions in his 
"final examination." These data for all items could easily be compared with 
data in the test manual for the norm sample.? 

d. A list of students' names, with two scores for each student: 


(1) a stanine score on the "anchor test" 
гар" —— 
(2) a raw score on the teacher's "final examination 


c cu ———— 


“The term "final examination” is put in quotation marks because it is assumed that each 
teacher would supplement the standardized test by questions he (or he and his school col- 
leagues) had devised, which measured outcomes that did not seem to be adequately measured 
by the standardized test. The teacher could combine the scores from the standardized test 
(ог portion thereof) with the scores from his own teacher-made test, weighting the scores in 


у Proportion which seemed best to him. : | 
"Under this plan, each student had to use two answer sheets, which represented an addi- 


tionol four cents per student for testing supplies. 

“The distributions of stanine scores on the reading and history tests for the school district 
were not needed as a basis for comparison. By comparing each class distribution with the 
Standard Percentages (see Chapter 2, page 41), the teacher could see how each class com- 
Pared with the school district in achievement on the anchor test and on ability to read 
assignments, In other words, comparison with school district figures could easily be made by 
the teacher but were not forced upon him. 

" Local curriculum research can be facilitated by comparing school-district performance on 
еасћ item with comparable data for the norm sample. For an illustrative study, utilizing this 
Әрргоасһ, see Tables 13.1 and 13.2 in this textbook and also "Curricular and Instructional 
implications of Test Results,” Test Service Bulletin No. 75 (New York: Harcourt, Brace & World, 
пс., n.d.), 


The second unusual feature is explained in Chapter 2. By equating the 
average standard score on a physics test to the average IQ of physics 
Students, and making similar adjustments for other high school subjects, 
the publishers have developed a set of standard scores that is comparable 
from Subject to subject. These norms make intraindividual comparisons 


450 THE IMPROVEMENT OF INSTRUCTION 


possible and facilitate the comparison of the achievement of individuals 
or groups with their expected achievement, as shown in Chapter 14. 

The first innovation is of great value to high school teachers who want 
to take advantage of the item-writing skills of specialists and yet avoid 
tailoring their instruction to any specific test. The second innovation is of 
value to teachers in detecting students whose general scholastic aptitude 
should enable them to achieve at a higher level in a subject. It is also of 
special value to counselors who can obtain a better picture of a student's 
relative achievement in different subject fields than can be provided by 
comparing his grades in these subjects. 


INTERPRETATION OF DATA FROM ACHIEVEMENT 
TESTING PROGRAMS 


In comparing achievement test results for different pupils, classes, Or 
schools, allowance must be made for differences in scholastic aptitude. 
The use of Anticipated Achievement Grade Placements has already been 
discussed as an approach that takes into account not only the student's 
scholastic aptitude but his age and grade. Although AAGP's are available 
only for the CAT series, other publishers have prepared devices similar 
to that presented in Table 13.4 to assist test users in modifying their 
expectations for students or groups whose IQ’s are considerably above or 
below average. Other characteristics of the student population, such as the 
level of vocabulary and language usage in neighborhoods of low socio- 
economic status, or the bilingual background of children, must also be 
considered in making interschool comparisons. 

Most school districts administer a selected achievement test battery at 
several different grade levels as a basis for ascertaining whether individual 
students and groups have made at least a year's gain during a calendar 
year. Differences among classes in average scholastic aptitude must also 
be considered in interpreting gains. 

When we attempt to make allowance for scholastic aptitude, however, 
another disconcerting problem, known as the regression cflect, complicates 
the problem. A review of the diagram in Figure 3.2 will remind the reader 
that scores on the predicted variable tend to regress toward the mean of 
the group. To use less technical language, the regression effect means that 
students who rank high on a scholastic aptitude test are not likely to rank 
quite as high on a test of achievement; while those who rank low on the 
aptitude test are not likely to rank quite as low on the achievement test. 

Durost and Prescott have illustrated this regression effect in their dis- 
cussion of expected stanine scores on achievement tests for students who 


451 


The Place of Standardized Achievement Tests 


"sen|p^ моцојлор pe|qpi eur ejqpija! әзош 
ayy '[^o1 450] əy; ur] uonp[our02 Əy} sayBiy ayy |олаџаб uy “40149 


(1961 "эш “Ром R 92024 "полон :ҳгод мем) 81 “ON оде juawainspaw әшоѕ of yalqns 220 juappainba гроб иоцодощәүү 
DING 459] ‘OJ s9ujulg jo зјалој eAsa»ong jp sjidng 203 шору грозд sly puo pj sidnd әш yoq aus ремојјо eq pinous Avmaay e|qo 
wos} ѕиоцо!ләд poj»edxg :sjso| учошала у иријодоцеуј :221105 -19pisuo2 '5аоиошлојлод үопр!л!ри! Бицаздлаји! uy AJON 51995 974 


85 SS 888 SS 
TE 9€ ТЕ tL I8 c8 v8 ззәзап$ рив OI 
uəəm}əq UOI]P[2.L10/) 


LL —— ————— 


el = ET- 9p- er yi ул 6L-SL 

О" = = ep £1I— cr Si 78—08 

g = к= = О 01— 01— а= 68-S8 

ge C= ЈЕ = g= [as = 6 = 16-06 

CS oc am E = S = = | == 66756 

| = v= | = 0 € = ЈЕ => = 701–001 

© + per Bg p e н Є = e + 601–501 

Lus E 9 = LE E Lp 8" + PII-OlL 

Lr EF 01+ TT crt 01+ + 61I-SII 

01+ EF Fu StF 81+ + 11+ fcI-0CI 

сре + 02+ occ vc 11+ pcr 621-01 
А = ——_—————— 
SLdH2NOO ANV — NOLLVLOdJNOO ayy DNITIddS ONIGVAY NOLLVN зоаялломя IN3ILOnO 
DNIA'IOS 'HO3d OLLSINT.LLDIV H85V09NVT -IATHOSIG ачом SONSOSITTRINI 


OLLJINH.LBIV ачом 


LE EEE 


"(Z ојозб) HI ssosng-saujulg Jo јалај eAIsse»ong 40 sjidng 40} шлом грозд) 


шоу зиоцомед pepedxj ‘(p-e зароло) 51531 IN3W3A3IHOY NVLITOdOI3W 
PEL 9ј949Д 


452 THE IMPROVEMENT OF INSTRUCTION 


are above average or below average in intelligence.'* For example, if the 
correlation between an intelligence and achievement test were .70, the 
most probable achievement test scores for individuals at each intelligence 
or capacity stanine would be as follows: 


MOST PROBABLE 

STANINE ON STANINE SCORE 

TEST OF INTELLIGENCE ON ACHIEVEMENT 
OR SCHOLASTIC APTITUDE TEST 


7.8 
1-1, 
6.4 
5.7 
5.0 
4.3 
3.6 
2.9 
22 


кюк ы U 4 UA OS ч о SD 


When we examine this table, we can see that a student with an intelligence 
stanine of 9 and an achievement stanine of 8 cannot be labeled an “under- 
асһіеуег.”! It should not be difficult for the student to see that the same 
principle is involved when a superintendent compares the achievement 
data for schools. It is difficult for a high-ability school to show as high an 
average score in achievement as in scholastic aptitude. 

Findley has made a suggestion for interpreting data for schools that 
would not only take into account this regression effect but many other 
factors that cause a group's achievement to be initially lower or higher, 
such as the low level of motivation among children who are academically 
retarded or come from homes of low cultural background. He suggests that 
we compare the gains of schools with similar initial median scores in a 
subject field. 


Factors of native ability, subcultural motivation and general instructional 
effectiveness to date have combined to produce initial medians at the grade 


18 Walter N. Durost and George A. Prescott, Essentials of Measurement for 
Teachers (New York: Harcourt, Brace & World, Inc., 1962), p. 87. 

19 We can easily check the results in this little table by translating stanine scores 
into deviations from the mean stanine of 5, and using the simple standard-score 
regression equation given on page 79. (This equation can be used because sta- 
nines are linear transformations of z-scores.) Since a stanine score of 9 is a devi- 
ation score of 4, we can substitute this value of 4 and our r of .70 in the standard- 
score regression equation to obtain: Predicted stanine deviation (for achievement) 
= (.70) (4) = 2.8. This deviation value is added to the mean of 5 to obtain the 
first value in the table. All others can be checked in the same way, or the regression 
effects occurring when r has a smaller or larger value can be obtained. 


The Place of Standardized Achievement Tests 453 


level in question. It may be fairly assumed that the same factors will continue 
to operate in similar measure thereafter except as growth data show the classes 
in a school to have improved more than others with a similar start. [Italics 
added]? 


Table 13.5 
Three-Year Gain Scores in Arithmetic (1956—1959) for Schools with 
Various Median Scores When Tested at Grade 3.1 in Fall 1956 


THREE-YEAR GAINS Initial Median Grade Placements 


(IN vEARS) 1.5-1.9 2.0–2.4 2.5-2.9 3.0-3.4 
4.0-4.4 1 1 
3.5-3.9 1 8 6 
3.0–3.4 3 20 5 
2.5-2.9 1 1 15 1 
2.0-2.4 3 3 4 
1.5-1.9 7 7 1 
1.0-1.4 3 2 
Median gain 17 19 31 3.5 


Source: Adapted from Warren G. Findley, "Gains vs. Status Scores as Evidence of Effective- 
Ness of Instruction,” The 19th Yearbook, National Council on Measurement in Education (Ames, 


lowa: The Council, 1962), p. 19. 


In Table 13.5, the median three-year gains in arithmetic are shown for 
schools with a similar start. All data are for schools initially tested in the 
fall of grade 3. It is difficult to conceive a fairer basis for interpreting 
data for schools. Such a procedure would allow each school to see how it 
Compared with other schools with a similar start, without revealing the 
Specific identities of the schools being compared. For example, in Table 
13.5, a school with an initial median arithmetic score in the 1.5-1.9 
group would know that their gain was above average for similar schools 
if they had progressed at least two years during the three-year period 
Studied. It would probably not be difficult for publishers to provide such 
data on one-year or two-year gains for a representative sampling of schools. 
Selective inmigration into, or outmigration from, the school neighborhood 
Would still have to be taken into account in interpreting data concerning 


gains for individual schools. е 
Tables 13.1 and 13.2 illustrate the way in which item norms сап be 


80 Warren G. Findley, “Gain vs. Status Scores as Evidence of Effectiveness of 
Instruction," 19th Yearbook, National Council оп Measurement in Education, 


(Ames, Iowa: The Council, 1962), pp. 17-20. 


454 THE IMPROVEMENT OF INSTRUCTION 


used to discover how the local instructional program differs in emphasis 
from the schools used in norming the test. It is doubtful, however, whether 
differences should be labeled by the terms "strengths" and “weaknesses” 
(as in Table 13.2). The differences merely represent reliable differences 
between local performance and “national” performance. It is only when 
these measurement data are screened against intended local emphases that 
an evaluation (or value judgment) can be made. 

ТЕ is desirable in any such curriculum study to have teachers agree in 
advance concerning (1) test items on which they hope to exceed national 
item norms, because of special emphasis in their local program and (2) 
test items on which they could accept lower-than-average results without 
apology because they have intentionally minimized these understandings or 
skills in order to allow proportionally more instructional time for others 
which they deem more important. 

It is also essential, in interpreting results of standardized achievement 
tests, to recognize that they measure only a portion of all educational out- 
comes. As will be emphasized in Chapter 15, evidence should be obtained 
concerning all important educational outcomes by the most dependable 
techniques available. If plans are not made for a comprehensive evaluation 
program, there is a real danger that the outcomes measured by stand- 
ardized achievement tests will be overvalued. 


LARGE-SCALE TESTING PROGRAMS 


Widespread concern has developed about the impact on students and 
school programs of the proliferation of large-scale testing programs." 
College-bound high school students, for example, may take college admis- 
sion tests, tests for national scholarship programs, tests administered by 
their own school districts, and sometimes additional tests given by a state 
department as a basis for “quality control.” 

One concern has to do with the amount of time devoted to such pro- 
grams. Certainly, every effort should be made to use students’ time to 
advantage and to avoid needless duplication of testing efforts. However, 
many national testing programs, such as the CEEB tests, are administered 
on Saturday; the testing time does not interfere with the regular school 
program. Moreover, these tests are optional with the student. 

Another concern has to do with student tension concerning examina- 
tions. Tension can be reduced to some degree when materials are presented 


21 An example of criticism in a news magazine is "A Rash of Testing in the 
Schools. Is It Being Overdone?” United States News and World Report, vol. 46 
(June 15, 1959), pp. 44-46. 


The Place of Standardized Achievement Tests 455 


to students in advance of testing that explain the purposes of a testing pro- 
gram, give illustrative examples of test items, and indicate how the test 
results will be reported to them and their schools.** 

Another concern is that large-scale testing programs may have adverse 
effects on the instructional program if teachers focus their attention on 
coaching for the test. If city-wide or state-wide programs of testing focus 
Narrowly on specific learnings, and comparisons are made among classes 
and schools, this type of deleterious effect is almost inevitable. During the 
early emphasis on survey testing such effects occurred and caused many 
teachers, as well as many leaders in curriculum and supervision, to feel 
antagonistic toward any type of achievement testing that did not originate 
with the teacher. Certainly tests used on a large-scale basis should stress 
the most important goals of education and allow for flexibility and variety 
in the means by which these goals are achieved. 

Another hazard of large-scale testing programs is that of using the scores 
routinely in making decisions about people.** In the selection, placement, 
and guidance functions of the school, teachers and counselors must use 
all the relevant data that we have about students and, when the student 
has sufficient maturity, help him in the interpretation of all the data 
relevant to choices he wishes to make. 

All of these hazards seem to be ones that can be avoided by the proper 
planning of testing programs and appropriate use of test results. 


SUMMARY STATEMENT 


Standardized achievement tests were widely used only after World War Е 
Studies revealing great subjectivity and unreliability in teachers marks stimu- 
lated the construction of objective, standardized tests in many curriculum 
areas, Comprehensive batteries of achievement tests were developed with which 
Students’ comparative achievements in such areas as reading, arithmetic, spell- 
ing, and language usage could be determined and could be compared with the 
Achievement of other students in their own grade. Н A 

The first standardized objective tests were used extensively in city and state 
Surveys. Publication of survey results led to recognition of the wide range of 


individual differences among students of à single grade level and aroused the 


teaching profession to the need for adapting instruction to individual differ- 


ences. The uncritical use of survey test results, however, led in some instances 
to an overemphasis on the measured outcomes of instruction and a consequent 


n material is A Description ој the 


22 An excellent example of such orientatio 
P N. J.: Educational Testing Service, 


College Board Achievement Tests (Princeton, 
1962), 


?' А helpful bulletin on this topic is william C. Daly, "Test Scores: Fragments 
Of a Picture," Test Service Notebook No. 24 (New York: Harcourt, Brace & World, 
Inc., 1959) 


456 THE IMPROVEMENT OF INSTRUCTION 


narrowing of the educational program. Curriculum changes and the child-study 
approach led to the development of a broader and richer conception of evalua- 
tion activities, including the appraisal of understandings, appreciations, atti- 
tudes, and interests. 

Standardized achievement tests are now used in most school systems to serve 
a variety of purposes. The achievement tests considered for use must be ap- 

raised in terms of their value for each of the major purposes for which they 
will be used. 

Each of the most widely used achievement test batteries was briefly described 
with special attention to organization of test content and any unique features 
that have been developed. Two series of end-of-course achievement tests for 
use in high school subjects were also described. Recommendations were made 
concerning the ways in which standardized end-of-course examinations might 
be used with due respect for the latitude professional teachers should have in 
adapting instructional programs to the needs of specific classes. The values 
and the problems associated with large-scale testing programs were briefly 
considered. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


CASSELL, RUSSELL N., AND EDWARD J. STANCIK, “Factorial Content of the Iowa 
Tests of Educational Development and Other Tests," Journal of Experi- 
mental Education, vol. 29 (December 1960), pp. 193-196. 

CROOK, FRANCES E., "The Classroom Teacher and Standardized Tests," Teachers 
College Record, vol. 58 (December 1956), pp. 159-168. 

DIEDERICH, PAUL B., "Pitfalls in the Measurement of Gains in Achievement," 
School Review, vol. 64 (February 1956), pp. 59-63. 

EBEL, ROBERT L., AND Е. M. RAUBINGER, “A Nationwide Testing Program— 
Opinions Differ," National Education Association Journal, vol. 48 (No- 
vember 1959), pp. 28-29. 

FERRIS, FREDERICK L., JR., "Testing in the New Curriculums: Numerology, 
‘Tyranny,’ or Common Sense?”, The School Review, vol. 70 (Spring 
1962), pp. 112-131. 

FINDLEY, WARREN G., "The Ultimate Goals of Education," School Review, vol. 
64 (January 1956), pp. 10-17. 

KATZ, MARTIN R., Selecting an Achievement Test: Principles and Procedures. 
Evaluation and Advisory Service Series, No. 3. Princeton, N.J.: Educa- 
tional Testing Service, 1958. Available on request. 

SEASHORE, HAROLD, AND J. E. DOBBIN, "How Can the Results of a Testing Pro- 
gram Be Used Most Effectively?" Bulletin of the National Association of 
Secondary School Principals, vol. 42 (April 1958), pp. 64—68. 

TRAXLER, ARTHUR E., "Use of Results of Large-Scale Testing Programs in In- 


struction and Guidance," Journal of Educational Research, vol. 54 (Octo- 
ber 1960), pp. 59-62. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 
1. When did standardized achievement tests begin to flourish in America? 


What factors gave impetus to the movement? What factors led to a reaction 
against standardized testing by many supervisors and teachers? 


The Place of Standardized Achievement Tests 457 


2. Interview the director of testing of a school system that has an organized 
achievement testing program. Describe the program in terms of the types of 
tests used, the frequency of their administration, and the specific uses of test 
results. 

3. List a number of reasons why standardized tests are used less frequently 
in social studies than in the basic skills. What other types of evaluation tech- 
niques are used to appraise student growth toward the goals of the social 
studies? 

4. Compare and evaluate the subtests on social studies or science in two or 
more achievement test batteries. 

5. Describe and evaluate two or more standardized achievement tests in your 
major subject field. Consult the reviews in the Buros’ Mental Measurements 
Yearbooks. 

6. Discuss the values and limitations of such item analyses as are presented 
in Tables 13.1 and 13.2 in helping a teacher to identify points of possible over- 
emphasis or underemphasis in his instructional program. | 

7. Discuss the relative advantages and disadvantages of the three possible 
policies with respect to the range of difficulty of items in a standardized test, 


as discussed in this chapter. 


14 Educational Diagnosis 


If individualized instruction is to attain maximum effectiveness, it must 
be based on educational diagnosis. Although the term “diagnostic study” 
implies a detailed study of the learning difficulties of an individual student, 
the more general term “educational diagnosis” includes within its scope 
all activities in measurement and interpretation that help to identify growth 
lags and their causal factors for individuals or for class groups. 

Five levels of diagnosis have been identified by Ross and Stanley, Of 
these, three are emphasized in this chapter: (1) Who are the pupils having 
trouble? (2) Where are the errors located? and (3) Why did the atiam 
occur? They list two still higher levels in the process of educational diag- 
nosis: (4) What remedies are suggested? and (5) How can the errors be 
prevented.’ The first four levels are concerned with corrective diagnosis; 
the fifth with preventive diagnosis—the discovery and modification of pre- 
ventable factors that are within the control of the school. 


MEASUREMENT AS BASIC TO EDUCATIONAL DIAGNOSIS 
AND INDIVIDUALIZED INSTRUCTION 


The following tenets indicate the author's point of view regarding the 
importance of educational diagnosis and individualized instruction, as well 
as the place of measurement in these aspects of education. These tenets 
serve not only as an introduction to our study of diagnosis but summarize 
the principles implicit in Part Ш on the use of measurement in the 
improvement of instruction. 


1C. C. Ross and Julian C. Stanley, Measurement in Today's Schools, third ed. 
(Englewood Clifis, N. J.: Prentice-Hall, Inc., 1954), p. 332. 


458 


Educational Diagnosis 459 


Interindividual Differences in Ability and Achievement Pose Crucial 
Problems for the Teacher 


A teacher soon recognizes that the students in his class differ widely 
in most of the characteristics related to learning. Some have little or no 
interest in the instructional activities of the class, whereas others have 
ап intense desire to learn. Some have a very meager vocabulary and a 
limited experiential background, whereas others have had rich cultural 
experiences in the home, in the community, and through extensive travel. 
Some have physical handicaps, poor health, impaired vision and hearing, 
Which hinder success in learning; others are vitally alert and well equipped 
to attain easy success, both within and outside the classroom, The teacher 
1S constantly aware that some students learn easily; whereas others find the 
work difficult, respond slowly to instruction, and require much practice. 


Intraindividual Differences Must Also Be Studied 


Knowledge of intraindividual differences is usually even more significant 
for teaching than knowledge of interindividual differences. A student, for 
example, may be below average for his grade in reading, markedly above 
average in arithmetic computation, and approximately at grade average in 
Other subjects. It is not unusual for a student to show a range of three to 
four years in his achievement in the various subtests of an achievement 
battery, In addition to individual differences in achievement, there are 
often highly significant differences in personality traits, cultural back- 
ground, and interests which affect students’ educational needs. 


Studying and Understanding Students Are Essential to Good Teaching 


Materials and techniques of instruction, if they are to be effective, must 
be geared to the learning needs of students. These needs are related to 
Students? aptitudes, interests, physical handicaps, and adjustment problems, 
as well as to their specific learning difficulties. Early recognition of a problem 
reduces the need for corrective work and prevents the development of 
More serious problems. : 

Teachers must become sensitive, therefore, to student behavior that 
Teveals lack of interest or feelings of discouragement and frustration. Such 
behavior must be recognized as an expression of need—as symptomatic 
9f problems that must Бе identified before the student can make normal 
Progress in academic work. Understanding the causal factors contributing 
to a student’s problem is essential to planning a corrective program. Study- 
ing all students in order to understand their problems is a fundamental 
aspect of effective teaching. 


460 THE IMPROVEMENT OF INSTRUCTION 


Standardized Tests Are Important Tools to Aid Teachers in 
Understanding Students 


For students who are retarded academically, the instructional program 
usually requires some modification. Results of standardized tests assist in 
determining the optimal difficulty of instructional materials to be used. 
Such students frequently have failed to master all the essential skills 
taught at preceding grade levels. Test results can aid in analyzing their 
difficulties and in ascertaining which skills, information, and concepts 
they have failed to acquire. 

Standardized tests can provide data that are more objective, more 
precise, and more analytical than those that result from unaided teacher 
judgment. Although teachers can usually identify students who are good 
and poor achievers, they are usually unable to identify the nature of the 
strengths and weaknesses in individuals within a subject беја, 


Standardized Tests Need To Be Supplemented by Other Techniques 
of Studying Students 


In the diagnosis of the causal factors underlying poor academic achieve- 
ment, as well as in the study of such problems as tenseness or predelin- 
quent behavior, the results of standardized tests must be supplemented by 
extensive data relating to the student's attitudes and feelings. 

The informal methods of child study, discussed in Chapter 8, can aid 
the teacher in developing a more adequate understanding of each student. 
These techniques involve direct study of the spontaneous behavior of the 
student in the many interrelationships of his everyday life. The teacher, 
however, must be aware of the limitations as well as the advantages of 
these subjective methods. Adequate training in, and discriminating use 
of, informal methods can make the teacher increasingly aware of recurring 
patterns of student behavior and their significance as an expression of 
the student’s needs and problems. 


Data Derived from Studying Students Must Be Interpreted Objectively 


In recording and summarizing data obtained through the more informal 
methods, the teacher must make certain that his interpretations are free 
from prejudice and personal bias. The trained observer knows how to use 


2 John C. Flanagan, “The Critical Incident Technique in the Study of Individuals,” 
Modern Educational Problems, Report of the Seventeenth Educational Conference 
(Washington, D.C.: American Council on Education, October 30-31, 1952). 


Educational Diagnosis 461 


the scientific method in organizing and interpreting data. He makes tenta- 
tive hypotheses about possible causes of recurring behavior, and he 
examines data that might support or refute these hypotheses. He recog- 
nizes that conclusions are justified only when evidence from different 
methods converge or tend to support the same hypothesis. 


Measurement, Evaluation, and Individualized Instruction Are 
Interrelated Aspects of Effective Teaching 


Evaluative judgments that are not based on reliable measurement, or 
the convergence of data obtained by different informal methods, may be 
incorrect and misleading. The significance of small differences in test 
Scores may be overestimated, or wishful thinking may replace objective 
appraisal. On the other hand, measurement that does not lead to evalua- 
tive judgments and to wiser decisions is of little value. 

Measurement of the strengths and weaknesses of a class, and of indi- 
vidual students, is essential as a basis for planning each step in an instruc- 
tional program. Such a study reveals the need for providing instructional 
Materials that are differentiated with respect to difficulty and content; it 
also points out specific needs for corrective teaching. Instruction that is not 
Individualized to meet the needs revealed by measurement tends to be 
ineffective and frustrating to teacher and students alike. Measurement, 
evaluation, and individualized instruction are therefore interrelated com- 


Ponents of effective teaching. 


The School Must Accept Responsibility for Developing Every Student 
fo the Maximum of His Potentialities 


To teachers who accept responsibility for helping all students develop 
fo the limit of their potentialities, differences among individuals (and 
among traits in the same individual) constitute a challenge. Through study 
and experience, these teachers gradually develop the understandings and 
techniques necessary for studying the needs of students and planning 
Individualized instructional activities. Group instruction is utilized for 
Subgroups within the class that are reasonably homogeneous in their ability 
to profit by such experiences. Individualized instruction is provided for 
Students who are deficient in the skills; and special classes or other modi- 
fications in program are provided for those who show serious growth lags. 

he effectiveness of programmed instruction with retarded and slow- 
earning students has demonstrated the need for greater individualization 
9f instruction, 


462 THE IMPROVEMENT OF INSTRUCTION 


At the secondary school level, the responsibility for individualizing the 
educational program is shared by guidance workers and teachers. The 
guidance workers help the student to select the courses, extracurricular 
activities, and work experiences that will be most valuable to him. Even 
if the counseling program is adequate, the classroom teachers must still 
teach fairly heterogeneous groups of students with a wide variety of inter- 
ests and needs. The high school teacher's greater student-contact load, 
and the shorter length of time he has students in class, make individualiza- 
tion of instruction more difficult at the secondary school level. The sec- 
ondary school teacher, however, can use plans and materials that capitalize 
on the greater maturity of the high school student and his increased ability 
to direct his own activities. 

A. small growth lag that is corrected early will not develop into a critical 
disability in later grades, when deficiencies in learning may be complicated 
by attitudes of defeatism and negativism on the part of the student, and 
when such deficiencies may damage the student's relationships with parents 
and classmates. Early identification of learning difficulties, followed by 
individualized instruction, constitutes economical and eflective preventive 
work and reduces the need for later remedial work. 


LEVELS OF DIAGNOSIS 


Faced with the obvious fact that he cannot do comprehensive diagnostic 
work with each student, the teacher is rightly concerned with determining 
the level of diagnosis he should attempt in specific situations. 

The term "level of diagnosis" cannot be defined precisely. An illustra- 
tion, however, may help to clarify the concept. A teacher may administer 
an achievement test battery and, on the basis of the results, note that a 
student's difficulty is in arithmetic computation, rather than in arithmetic 
reasoning, reading, language, or other school subjects. The diagnosis may 
be carried to another level by determining in which of the arithmetic proc- 
esses he is weak (for example, division). Diagnosis may be carried to a 
still more specific degree by discovering that the student does not know 
how to estimate trial divisors, although he does have an adequate familiarity 
with the division combinations. 

No rule can be established as to the level of diagnosis that is appro- 
priate in a specific situation. In general, it may be said, however, that a 
satisfactory level of diagnosis has been reached when the teacher has 
gained sufficient insight into the nature of the student's problem to enable 
him to plan appropriate corrective instruction. This will be determined in 
large part by the complexity of the individual problem. 

As Tyler has said, “А satisfactory diagnosis should be as specific as the 


Educational Diagnosis 463 


desired outcomes permit and as the possibility of localization of symptoms 
allow, so long as the diagnosis is practicable. It need not be carried farther 
than is appropriate for the remedial program provided."* 


STEPS IN EDUCATIONAL DIAGNOSIS 


The essential steps in educational diagnosis are (1) identifying the stu- 
dents who are having trouble, (2) locating the errors or learning difficul- 
ties, and (3) discovering the causal factors. In the following discussion 
of these three steps in educational diagnosis, the illustrative material will 
be concerned chiefly with diagnosis in the basic skills. This section is 
followed by a summary of suggestions on group diagnosis and other pro- 
cedures that are practicable for teachers of high school content subjects. 


Identifying the Students Who Need Help 


The inclusion of this step is based on a recognition of the realities of 
the typical teaching situation. All students can profit from individualized 
help, given on the basis of educational diagnosis. In the typical classroom, 
however, individualized corrective instruction can be given to only a few 
Students. Hence, the teacher's first step is to identify the students in each 
Subject area for whom diagnosis and corrective instruction are imperative 
In order for them to participate in group instruction with profit. 

Although students who are in the greatest need of corrective instruction 
сап be identified by scanning achievement test profiles, data on scholastic 
aptitude, and other relevant data, the following procedures will be found 


10 be systematic and objective: 


1. Plot each student's test (or subtest) score and some measure of his expected 
achievement on a two-variable chart, similar to Figures 14.1 or 14.2. 

2. Draw а "staircase" diagonal from the lower-left- to the upper-right-hand 
Corner of the chart, which includes at each level the cell or square for 
Students whose converted scores on the intelligence and achievement tests 
Correspond (fall in the same interval), as well as the adjoining cells for 


those whose scores differ by only one interval. | -— vx 
3. Note the names of all students whose scores are outside the "staircase" di- 


agonal. Those students whose tallies are to the left of the staircase appear 


*Ralph W. Tyler, “Characteristics of а Satisfactory Diagnosis,” Educational 
Diagnosis, 34th Yearbook, National Society for the Study of Education (Chicago: 


University of Chicago Press, 1935), p. 106. . ns 
"ДЕ carefully selected programmed materials have been made available, indi- 


Vidualized instruction for all may be possible. 


464 THE IMPROVEMENT OF INSTRUCTION 


to be “underachieving”; those to the right appear to be “overachieving.” 

These two terms are tentative classifications, applied to students whose 

achievement scores are significantly lower, or higher, than measures of their 
i itude. 

4. pepper oes who appear to be overachieving, check to see whether the 
student is overachieving in other achievement tests. If he is, examine data 
on previous scholastic aptitude tests on his cumulative record. The most 
recent test of scholastic aptitude may have given him a lower IQ than he 
obtained on previous tests. If he is a new student, with no previous intelli- 
gence test data available, a retest should probably be given. 

5. For all students who appear to be underachieving, further diagnostic work 
should be undertaken. More information on these individuals should be 
obtained as a basis for developing (a) hypotheses regarding the reasons for 
their low achievement and (b) suggestions for helping them toward more 
adequate achievement. Diagnostic study and corrective instruction are 
especially important if these students are below national norm (which repre- 
sents society's standards for that grade level) or if their level of achievement 
is likely to handicap them in the pursuit of their educational and vocational 
goals. On the other hand, a brilliant student planning to major in literature 
would not be considered as needing diagnostic study if his stanine in intelli- 
gence were 8 or 9, and his score in mathematics or science were 2 stanines 
lower. 


USING LOCALLY DEVELOPED NORMS ON A TWO-VARIABLE CHART Figure 
14.1 illustrates a type of two-variable chart that can be used with any 
standardized test that has stanine scores, for example, the Metropolitan 
Achievement Tests. Stanine scores represent equal units of % SD; more- 
over, stanine norms are easily developed from local data. 

Another unusual feature of Figure 14.1 is that data for a single class 
are superimposed on a two-way table for a school. The teacher has written 
in the names of the students in his class on a mimeographed table pro- 
viding data for the school; such a table could be provided for a school 
district or larger unit. In this way, the teacher can see how many of his 
students are working higher or lower than expectancy (in terms of scho- 
lastic aptitude); and he can also compare the distribution of scores for his 
class with those for the school or school district. 

For example, in Figure 14.1, only 2, or 6 percent, of the students in 
Miss Lee's class appear to be overachievers, as compared with 22 percent 
in the school. On the other hand, 48 percent of Miss Lee's class appear to 
be underachievers, as compared with 21 percent in the school. 

Without further information, we do not know whether Miss Lee's stu- 
dents were grouped into a low-achiever class, whether the students have 
made inadequate progress in eighth grade, or both. We can easily determine 
that the median stanine on IO is 4, while the median in arithmetic is one 
full point lower. We would especially like to have more information about 
the seven students for whom the difference between achievement and 
expectancy is very large, that is, three or more stanines. 


Educational Diagnosis 465 


Stanines on Local Arithmetic Test 


WALTER 


Stanines on Pintner General Ability Test: Intermediate Test 


Fig. 14.1 Names of 33 Pupils in Miss Lee's Eighth-grade Arithmetic 
Class, Entered on a Two-Variable Chart for All Eighth-graders in 
Central High School. 
This chart is designed to show the relationship between stanine scores on the Pintner 
General Ability Test and stanine scores on the local arithmetic test. (The r between these 
? А " 
two tests is approximately .52.) All cases within the diagonal "staircase" have the same 
Stanine scores in both the intelligence and arithmetic tests, or their stanines in these two 
" "stai "(q 
tests differ by only one score. For all cases to the right of the diagonal “staircase” (in 
Fe aat ч H 
this case 23 percent of Central High School students), the students arithmetic stanine 
exceeded his PGAT stanine by two or more points. For all cases to the left (in inis case 
21 percent), the student's arithmetic stanine was two or more points below his PGAT 
stanine, 


USING THE AAGP AS A MEASURE OF EXPECTANCY In studying the re- 
Sults for subtests of the California Achievement Tests, an especially de- 
Signed measure of expectancy (the Anticipated Achievement Grade 
Placement) can be used. One can compare each student's achievement 
With his AAGP, that is, the average grade placement earned on that sub- 
test by students of his grade, age, and mental age. Table 14.1 lists reading 
test results for 50 students and indicates the difference between each stu- 


466 THE IMPROVEMENT OF INSTRUCTION 


dent's reading comprehension GP and his AAGP. Those differences 
smaller than one SEaitterence are placed in parentheses. Asterisks are used 
to indicate the level of assurance with which we can interpret other differ- 
ences as being greater than chance diflerences, or unlikely to be reversed 
on retesting. 


Table 14.1 
Anticipated Achievement Grade Placements and Grade Placements in 
Reading Comprehension? for a Class of Ninth-Grade Students 


ANTICIPATED 
ACHIEVEMENT TOTAL READING 

STUDENT GRADE PLACEMENT? GRADE PLACEMENT DIFFERENCE^ 
"C———————————— 
1. Albert 10.1 8.9 

2. Alfred 9.8 7.3 

3. Adele 8.8 5.0 

4. Arnold 113 9.1 

5. Arleen 8.6 8.5 

6. Audrey 11:2 10.0 

7. Brian 10.2 10.3 

8. Byron 9:3 8.9 

9. Carol 6.4 6.9 

10. Colleen 9.2 6.7 

11. Curtis 9.2 9.2 

12. Dale 9.6 7.7 

13. Dean 6.9 55 

14. Diana 8.5 9.1 

15. Donna 10.3 10.5 

16. Dorothy 10.5 10.6 

17. Douglas 77 9.0 

18. Edgar 7.2 72 

19. Elaine ма 8.8 

20. Eugene 7.6 8.0 

21. Floyd 5.8 6.6 

22. Freda 97 10.7 410 
23. Gladys 8.0 6.7 —1,3% 
24. Grace 6.4 6.1 (—0.3) 
25. Guy 6.9 7.9 +1.0 
26. Нагусу 10.4 10.3 (—0.1) 
27. Hazel 9.6 6.6 а 
28. Нејеп 6.9 8.0 
29. James 7.5 7.5 
30. John 10.6 10.6 (0.0) 
31. Janet 10.8 10.7 (—0.1) 
32. Joyce 8.4 9.7 +1.3* 
33. June 5.9 6.9 +1.0 
34. Leroy 7.1 7.0 (—0.1) 


35. Louis 8.8 10.0 +12 


Educational Diagnosis 467 
(ЕРВИН НЕНА c CPU 


ANTICIPATED 
ACHIEVEMENT TOTAL READING 
STUDENT GRADE PLACEMENT? GRADE PLACEMENT DIFFERENCE® 
m CERE gai a == ee 
36. Mabel 8.9 6.9 —2.0** 
37. Mary 73 9.5 
38. Nancy 9.3 8.9 
39. Phillip 9.7 8.3 
40. Ralph 10.1 8.4 
41. Robert 10.1 10.3 
42. Russell 9.7 8.8 
43. Sarah 10.9 10.7 
44. Susan 8.5 7.5 
45. Thelma 8.3 74 
46. Walter 11.4 11.0 
47. Warren 8.3 6.1 
48. Wendell 8.3 74 
49. Wilma 10.9 9.6 
50. Winifred 8.8 5.0 
Total 436.8 416.1 
Mean 8.7 8.3 


c —X ш = — Ó—À——— c—-————ÀM 


" As measured by the California Reading Tests, Junior High Level, administered during the 
first month of the school year (Monterey, Calif: California Test Bureau, 1957). 


"Obtained from special tables for the Anticipated Achievement Grade Placement, (These 


are individualized expectancy norms for the California Achievement Tests.) For example, 


Alber's AAGP of 10.1 indicates that this is the average grade placement in reading com- 
Prehension for students in the norming sample in the same grade, with the same age and the 


same mental age as Albert. 

aus 

Differences smaller than the SEuirrerence i 
asterisk indicates a difference large enough to be significant at the 10 percent level, that is, 


a difference large enough to occur in only 10 percent of the sample testings if the true differ- 
у 10 percent, or one out of ten, that the 


(.78) have been enclosed in parentheses, One 


nee were zero. In other words, the odds are onl i Ж 
difference is due to chance and would not be found on retesting. Two asterisks indicate a 


difference large enough to be significant at the 5 percent level, with an even lower probability 


of a T 
representing a chance difference. 


In constructing the scatter diagram in Figure 14.2, each student's iden- 
tification or code number was entered in the square that located him with 
respect to both his AAGP (Anticipated Achievement Grade Placement) 
ànd his RCGP (Reading Comprehension Grade Placement). For example, 
Student 13 (Dean) had an AAGP of 6.9 and an RCGP of 5.5; hence his 
Code number (13) was entered in the square for an AAGP of 6.5—6.9 and 
ап RCGP of 5.5-5.9 (that is, in the square formed by the intersection 
ОЁ the row and column in which pupils of Dean's ability level are 
Tecorded), 


468 THE IMPROVEMENT OF INSTRUCTION 


Total Reading Grade Placement 


Anticipated Achievement Grade Placement 


Fig. 14.2 Scatter-Diagram of Anticipated Achievement Grade 
Placements and Total Reading Grade Placements for 50 Ninth-grade 


students on the California Reading Test: Intermediate (Monterey, Calif.: 
California Test Bureau, 1957). 


To aid in the interpretation of results, a pair of heavy horizontal lines 
has been drawn to enclose the class interval containing the mean (8.7) of 
the AAGP's. All students whose code numbers appear above this line are 
above average in expected achievement, The pair of heavy vertical lines 
enclose the class interval containing the mean reading comprehension grade 
placement of 8.3. All students with reading comprehension grade place- 
ments to the right of this line have achieved above the group average, 
whereas those to the left are below the group average.* All students whose 
code numbers appear in the intervals to the left of 9.0 are below the na- 
tional norm. Students in the upper left hand quarter of the scatter diagram 
are above average with respect to expectancy but below average with 
respect to achievement. 


The "staircase" diagonal includes the code numbers of students who are 


5 Since all data are grouped by intervals, students whose scores are within the 
interval in which the mean lies are considered to be at the average level, rather 
than either above or below. 


Educational Diagnosis 469 


working "approximately at expectancy level." It will be noted that in each 
case this channel includes the square for students whose AAGP and RCGP 
correspond (fall in the same interval) plus the adjoining intervals, repre- 
senting a deviation of one half year in either direction. In this group of 
50 pupils, 20 are in this center channel and can therefore be described as 
reading “approximately at expectancy level”; 9 are to the right of the 
channel, indicating that their reading is "above expectancy"; and 21 are 
to the left of the channel, or reading *below expectancy." 

Among those who appear to be underachievers (those reading below 
expectancy), Alfred and Winifred have been chosen as examples. With 
AAGP’s of 9.8 and 8.8, respectively, Alfred and Winifred have reading 
grade placements of 7.3 and 5.0, respectively. Obviously, these two stu- 
dents, in addition to many others in this group, need diagnosis and probably 
need corrective instruction in reading. Those students who are reading 
significantly below both the national norm and their own ability level can 
also be easily identified from this scatter diagram. These students can profit 
from remedial instruction much more than such students as Carol, who, 
although she reads well below national norm, is achieving up to expectancy." 
In interpreting an expectancy chart, the teacher should bear in mind that 
the data on both ability and achievement reflect errors in measurement. 


USING EXPECTANCY CHARTS IN HIGH SCHOOL suBJECTS By means of 
Specially designed expectancy charts,’ it is possible to compare students’ 
achievement in various high school subjects with their ability level. In 
Figure 14,3, the code number for each student is plotted in the column 
Corresponding to his deviation ТО on the Otis, Pintner, ог Terman- 
McNemar tests, and at a level in that column that corresponds to his 
achievement on the algebra test. For example, student 12 had an IQ of 
88 on the Terman-McNemar test and a standard score of 117 in algebra 
(which is equivalent to a percentile rank of 80, as shown in the column 
to the right). When his code number is entered on the expectancy chart, 
it is seen that he would rank at the 98th percentile in comparison with 
Students of his own ability level. In fact, as with Mary in the ninth-grade 
(Fig. 14.2), it seems advisable to obtain additional data to determine 
Whether the intelligence quotient of this student has been underestimated 
by his performance on the Terman-McNemar test, used in plotting this 
chart. 


“In making this analysis, it is assumed that the intelligence test used in obtaining 
the AAGP has provided a valid measure of the student's mental age. If the student's 
reading handicap is sufficient to have invalidated his score on the test, administra- 
Чоп of a nonlanguage intelligence test may be advisable. А А 

“Expectancy charts are available for tests of the Evaluation and Adjustment 
Series (a series of survey tests in the major high school subjects published by Har- 


Sourt, Brace & World, Inc.). 


470 THE IMPROVEMENT OF INSTRUCTION 


DEVIATION IQ. 


2-16 rernm quz 


86-091 90-93 94-97 | 90-109) 
35-84 |аз-90 | 91-95 | 26-101|102~ 
3-0» [90-24] 95-92 | ic0-104] 105-109] 


ENESTE 
35 || за 
135-139] 149. 


I: Very High 


= 


T: High 


: Average 


BASIC DATA 
SD Aene? 10.0 

62 Mint, = 1061 
„5124 SDy,-149 
Тдећла.=-63 N= 300 
SEnteas(ahi)® 4-7 


Doirn? 


EM 


valas - e»|so - 94 [23 -ээ [100-104] 105-109| 0-114 fus- пэ |гго-12а [25-2] 123-139 140-144 145-149] 
57 | эзо [ione | tone 108.0 | n: | ињљ | 200 neo | 1254 | 126.0 


T-M. IQ. |70 = 74|75 ~ 79] 00- 


ssa | з» 


Fig. 14.3 Test Results for a Class Completing First-year Algebra, Charted 
on the Lankton First-Year Algebra Expectancy Chart. Achievement test results: 
standard scores on the Lankton test; intelligence test results: deviation 
IQ's on the Terman-McNemar Group Test of Mental Ability. 


From Manual, Lankton First-Year Algebra Test. Copyright 1951 by Harcourt, Brace & World, Inc. 
New York, N.Y. Copyright in Great Britain. All rights reserved. Reproduced by permission. 


Students 1, 3, and 13 are all high-ability students whose achievement in 
algebra is even better than would be expected from persons of their learn- 
ing ability. They merit special encouragement to continue in mathematics, 
and the algebra teacher may wish to plan enrichment activities for them. 
Students 2 and 22 also have superior ability but are achieving far below 
their potential. A little time spent in diagnostic study of the reasons for 
their low achievement would probably bring good results, since these stu- 


Educational Diagnosis 471 


dents should be able to make excellent progress with a minimum of teacher 
assistance. Those students with IO's above 100 and achievement below 
average for their ability level could profitably be assigned diagnostic testing 
material and given special help as a group with their common problems 
or learning difficulties. The advisability of continued instruction in algebra 
for students 4, 5, 6, and 10 might well be discussed with the high school 
counselor, who would have additional data on learning ability, motivation, 
and career plans. 


Locating the Errors or Learning Difficulties 


If a student has been identified on a standardized test as working below 
expectancy in some area, for example, arithmetic fundamentals, the 
teacher's next step depends upon the type of standardized test he has used. 
A few standardized tests (for example, the California Achievement Tests) 
provide subtest scores in addition, subtraction, multiplication, and division, 
which help the teacher to determine specific areas of difficulty for the 
Student. By means of the Diagnostic Analysis printed in the test booklet, 
the teacher can obtain still further leads concerning the types of error 
made. If the Scoreze device has been used, he can easily identify the spe- 
cific problems missed and, by referring to the printed descriptions, can 
Classify them by type. 

Study of the Diagnostic Analysis a 
Student might suggest that he had maste 
culty with problems involving zeros, or ; сој tly fa 
lems requiring inversion of the divisor in the division of fractions, Such an 


analysis, however, provides only clues or leads. The teacher must follow 
Up such leads by assigning similar problems and noting the student's per- 
formance on a larger sampling of problems of the same type. 

For students who are markedly retarded in the fundamental skills, a 
More comprehensive diagnosis of learning difficulties is desirable as a basis 
for a well-planned program of corrective instruction. For this purpose, 
Special diagnostic tests are available, designed to give evidence on students 


Specific retraining needs. 


nd the Scoreze? for an individual 
red the combinations but had diffi- 
that he consistently failed in prob- 


. *For an illustration of the use of the Diagnostic Analysis and Scoreze for an 
Individual pupil, see Theodore L. Torgerson and Georgia Sachs Adams; Measure- 
= and Evaluation for the Elementary School Teacher (New York: Holt, Rine- 
art and Wi . 206-208. 

эи is басу ~ ну ер rentia and arithmetic, however, that much work 
has been done in the development of published diagnostic tests. A number of spe- 
Cific tests and techniques that can be used for diagnosis in the skills subjects are 
Presented in Chapters 11 through 14 of the authors’ textbook for elementary school 
teachers and Chapters 13 through 16 of the authors’ textbook for secondary school 


472 THE IMPROVEMENT OF INSTRUCTION 


The subtests of a diagnostic test should measure component skills that 
are critical for success in the subject, for example, word attack skills in 
reading. Each subtest on a component skill should be loaded with oppor- 
tunities for one type of error and be as free as possible from other sources 
of learning difficulty. Ideally, each subtest should be long enough so that 
intraindividual differences between related abilities can be reliably meas- 
ured. The student will recall that when tests are highly correlated, the dif- 
ferences between scores tend to have low reliability. 

Since a high level of reliability is seldom achieved in diagnostic tests, 
the user should interpret the results cautiously, making tentative hypotheses 
about learning difficulties and appropriate corrective instruction. The 
teacher must be willing to revise such hypotheses as additional information 
is obtained and must search for additional clues if corrective instruction 
proves ineffective. . | 

Ideally, diagnostic tests should be directly related to the materials avail- 
able for corrective instruction. They should always be gauged to the 
achievement level of the retarded student, that is, with a large number of 
items at his achievement level. Norms are relatively unimportant, since the 
chief purpose in diagnostic testing is to discover what the student cannot 
do and why, rather than to compare his achievement with grade standards. 

The teacher who administers diagnostic tests to students with learning 
problems not only obtains data that aid these students being studied but 
increases his own understanding of the learning process and of typical 
errors and confusions among retarded students. In other words, the teacher 
who uses and interprets diagnostic tests obtains valuable in-service training, 
which may not only improve his teaching and his selection of practice 
materials but may help him to interpret his everyday observations of stu- 
dents during supervised study periods and may change his correction of 
homework from a routine clerical task to one that has real diagnostic value. 


DIAGNOSING LEARNING DIFFICULTIES IN ARITHMETIC Ideally, diagnostic 
tests in arithmetic are keyed to general screening tests, as well as to indi- 
vidualized practice materials. For example, a teacher using the Brueckner 
series can administer опе or more of the screening tests (available for 
whole numbers, fractions, and decimals). Then, on the basis of results on 
these screening tests, he can select appropriate diagnostic tests from the 


teachers. T. L. Torgerson and Georgia Sachs Adams, Measurement and Evaluation 
for Elementary School Teachers (New York: Holt, Rinehart and Winston, Inc., 
1954); Georgia Sachs Adams and T. L. Torgerson, Measurement and Evaluation 
for Secondary School Teachers (New York: Holt, Rinehart and Winston, Inc., 
1956). 


? Lee J. Brueckner, Diagnostic Tests and Self-Helps in Arithmetic (Monterey, 
Calif.: California Test Bureau, 1955). 


Educational Diagnosis 473 


23 such tests in the series. On the basis of the results from such individual- 
ized testing, students can be assigned to work on any of 23 sets of correc- 
tive "self-help" exercises, keyed to the diagnostic tests; and other suggested 
remedial procedures can be used. 

For some children, printed tests and self-help materials are not adequate; 
for example, a student may not have the reading ability and/or the motiva- 
tion level to work alone on such materials. For such children, it may be 
necessary to have the student work the problems aloud. In this way, one 
can study the process as well as the result. In observing students during 
Supervised study, the teacher may note evidence of counting or other 
roundabout methods of computation. 

A. few moments of observation at a critical time in the introduction of a 
new process may forestall later difficulties. Through studying the work of 
retarded students on their regular assignments, the teacher may be able to 
Spot recurring errors and misconceptions. | 

Problem solving in arithmetic requires competence not only in compu- 
tational skills but in many other skills as well. A command of reading 
Skills is essential in problem solving. Treacy found that students who were 
poor in problem solving tended to be deficient in both general vocabulary 
and arithmetic vocabulary, as well as in four subtests of the Diagnostic 
Examination in Reading Abilities.™ . | 

In diagnosing a student's difficulties in problem solving, the teacher will 
Wish to note: 


- his test scores in work-type silent reading; | | 
his knowledge of arithmetic vocabulary and symbolism (as revealed in sub- 
tests of the SRA Achievement Series, the California Achievement Tests or 
other standardized tests; 

3. his scores on exercises in proble 
estimating answers, and the like 
Mentioned tests. 


N = 


m solving (selecting the process to be used, 
), such as those provided in the above- 


Before undertaking a thorough diagnosis, however, the teacher should 
ascertain the extent to which errors in arithmetic problems may be due 
to errors in computation, carelessness in arranging work, and general lack 
Of neatness, For these causal factors, drill in arithmetic combinations plus 
improved motivation may be all that is needed. A more thorough diagnosis 
is not required unless the student shows evidence of failure to grasp the 
quantitative relations involved or generally inefficient methods of problem 
Solving. 


“John P. Treacy, “The Relationship of Reading Skills to the Ability to Solve 
Arithmetic Problems,” Journal of Educational Research, vol. 38 (October 1944), 
Pp. 86-96, 


474 THE IMPROVEMENT OF INSTRUCTION 


DIAGNOSING LEARNING DIFFICULTIES IN READING After the students 
who need special help in reading have been identified, diagnostic tests 
should be administered to determine the specific nature of students diffi- 
culties. Results on the Survey section of the Diagnostic Reading Tests, 
for example, will help the teacher to decide which of the diagnostic tests 
should be given. For students whose general level of reading comprehen- 
sion is below their ability level, both parts (Silent and Auditory) of Sec- 
tion 2 (Comprehension) should be administered. For students who attain 
low reading-rate scores on the Survey test, more specific diagnostic infor- 
mation can be obtained by the administration of Section 3 (Rates of Read- 
ing), which provide data on the student's reading rate in two subject-matter 
areas, as well as his ability to read rapidly under pressure. Finally, for 
students who have scored low in vocabulary and silent reading as compared 
with auditory comprehension, the teacher may wish to administer indi- 
vidually Section 4, on Word Attack. In the first part of this test, the student 
is asked to read aloud six graded paragraphs of interesting general-type 
reading material. The teacher observes the student's reading attitude and 
methods, and records all errors, using the recommended notations for sub- 
stitutions, omissions, repetitions of two or more words, mispronunciations, 
insertions, and the like. 

The teacher of a remedial-reading class may need to know the words in 
a primary sight vocabulary that a student has not learned or the specific 
skills in word attack that he has not mastered. Under such circumstances, 
individual administration of a word-perception or oral-reading test may be 
indicated. А student's performance in oral reading reveals his mastery of 
the basic skills of word recognition and word analysis. When a pupil reads 
orally, the teacher can note his reading habits and his fluency in reading, 
as well as the kinds of errors he makes in pronunciation. 

An oral-reading test is usually administered by having a student read 
each paragraph aloud while the examiner evaluates his performance, 
recording such errors as hesitations, insertions, mispronunciations, omis- 
sions, repetitions, and substitutions. If a student's oral-reading performance 
is recorded on tape, more objective analysis and scoring are possible. In 
some oral-reading tests, the student's comprehension of the material read 
is checked by having him respond orally to standard questions read to him 
by the examiner. On some of the tests, the student's rate of reading is also 
noted. Three illustrative oral-reading tests will be briefly described. 


1. The Gates Diagnostic Reading Tests’? are individually administered oral- 
reading tests. Two forms are available. The series consists of a pupil's record 
booklet for the teacher, two sets of cards containing test material for the 


12 Arthur I. Gates, Gates Reading Diagnostic Tests (New York: Bureau of Pub- 
lications, Teachers College, Columbia University, 1945). 


Educational Diagnosis 475 


child, and a manual that contains directions, grade and age norms, and 
suggestions for remedial procedures. From the 18 tests one can choose 
those likely to help in analyzing the child's specific difficulties. 

2. The Durrell Analysis of Reading Difficulty** for grades | to 6 consists of a 
set of eight graded paragraphs to measure oral reading habits, another set 
of eight paragraphs for measuring oral recall, and two additional sets of 
paragraphs of comparable difficulty to be read silently in order to measure 
oral and written recall. The series also contains a total of 175 words to be 
flashed in a cardboard tachistoscope, in order to measure word recognition 
and word analysis. An accompanying record booklet contains several classi- 
fied checklists of reading difficulties. 

3. The Gilmore Oral Reading Test! consists of ten oral-reading paragraphs 
that form a continuous story. The paragraphs are scaled in difficulty, and 
each is accompanied by five questions to check comprehension. The test 
is designed to measure accuracy of oral reading, rate of oral reading, and 


comprehension in grades 1 through 8. 


handwriting, language usage, 


For suggestions for diagnostic work in guage 
areas, the student is referred to 


foreign language, and several other subject 
the authors’ textbooks for elementary and secondary school teachers. 


Discovering the Causal Factors 


difficulty, the causal factors are relatively 


In some cases of learning a Е d 
ention, insufficient or ineffi- 


Simple. A student may (as the result of inatt t 
cient practice, or irregular attendance) have failed to learn basic vocabu- 
lary or verb forms in foreign language or to understand the process of 
multiplying signed numbers in algebra. If such causal factors are temporary, 
it is sufficient to identify and remedy the gaps in the student's learnings. 
If, however, inattention or any other behavior is a persistent factor, it may 
be symptomatic of underlying difficulties. In these situations, attempts must 
be made to identify and remedy the basic causes. Until this is done, cor- 
Tective instruction will remain at the “patching up” level of effectiveness. 

If a student consistently achieves below expectancy in most of his sub- 
jects, certain basic causal factors, such as poor health, faulty work habits, 
ог emotional maladjustment may be involved. Although no single factor 
May seem to be serious, the combined effect of several factors may produce 
Significant scholarship and behavior problems. | 

Although the basic causes of low achievement for a given student are 
usually complex and interrelated, most of them can probably be classified 


Under five major categories: 


" Donald D. Durrell, Durrell Analysis of Reading Difficulty (New York: Har- 


Court, Brace & World, Inc., 1937). 
John У, Gilmore, Gilmore Oral К 
& World, Inc., 1952). 


eading Test (New York: Harcourt, Brace 


476 THE IMPROVEMENT OF INSTRUCTION 


DISABILITIES IN THE BASIC SKILLS  Retardation in many subjects can 
frequently be attributed to retardation in the basic skills of reading and 
arithmetic. Difficulties in social studies, science, and many other subjects 
may be due to an inability to read textbook materials with comprehension. 
Difficulties in arithmetic reasoning or in the higher processes of arithmetic 
computation are attributable, in part, to the fact that the basic arithmetic 
skills are not functioning at an automatic level. Difficulties in algebra, 
chemistry and physics, industrial arts, and other subjects may be due, in 
part, to deficiencies in arithmetic. 

Corrective instruction in the basic skills is necessary if the students are 
to get adequate returns from their study time in the subjects in which these 
skills are constantly demanded. 


INADEQUATE WORK-STUDY SKILLS Many students fail to do their best 
work because of inadequate methods of attacking learning problems. Many 
subjects place a premium on the student's ability to find information in 
books, to locate related source materials, and to read maps, graphs, and 
tables. The student must know the technical vocabulary of a subject field 
if he is to comprehend the content. 


SCHOLASTIC APTITUDE FACTORS Although some authors list *inade- 
quate mental maturity" as a major cause of learning difficulties, it is obvi- 
ously the disparity between the abilities required in the teaching-learning 
situations and the student's mental maturity that is at fault. In other words, 
retardation and discouragement may result from the student's having been 
programmed into subjects that are too difficult for him or from the use of 
instructional materials that are too difficult. 

The modern school has accepted the responsibility of helping each stu- 
dent achieve to the level of his capacity. The limitations of group instruc- 
tion, however, make it desirable that the student's mental maturity not 
deviate too markedly from the average of his class or his instructional 
group within the class. When learning tasks are too difficult, frustration 
destroys interest and incentive; when they are too easy, boredom can lead 
to minimum involvement and inadequate effort. 


PHYSICAL FACTORS Chronic diseases, impairment of vision and hear- 
ing, and other physical handicaps interfere with learning. Lowered vitality, 
distractibility, and irregular attendance hinder learning and decrease re- 
tention. The student's resulting discouragement and lack of interest may 
contribute, in turn, to further retardation. 

The cumulative health records of inefficient learners should be checked 
and interpreted with the aid of the school nurse. Such а record not only 


Educational Diagnosis 477 


indicates a student's health status as of the time of his latest medical ex- 
amination but also provides a longitudinal picture of his physical develop- 
ment. Significant physical characteristics are recorded there as well as 
recommendations that have been made to parents and educators concerning 
necessary remedial or preventive measures. In schools that use the Wetzel 
"grid? (a graphic record form for cumulated height-weight data), signifi- 
cant "growth failures" (or deviations from a student's expected growth 
pattern) can be readily identified. In a study of more than 2000 students, 
the Wetzel grid identified 95 percent of students classified as "poor" or 
"borderline" by school physicians.?* 

Impaired vision is a serious hazard to learning. The Snellen Chart, the 
most widely used screening test, fails to detect moderate farsightedness or 
astigmatism, as well as difficulties in eye coordination; in fact in one study, 
Use of the Snellen Chart detected eye defects in only about half of the 
students who showed defects in a thorough ophthalmological examination.!* 
Hence, many school systems use а telebinocular to measure eye-muscle 
balance, depth perception, visual fusion, and other abilities important to 
effective reading; even these tests, however, are intended only to identify 
students who should be referred to ophthalmologists for further study. 

Defective hearing is a serious hazard to learning. The widespread use 
of audiometers for testing students’ hearing has greatly increased our effec- 
tiveness in early identification of hearing loss. As many as 40 students can 
be tested at a time; students who do poorly on a group audiometer test 
can be tested individually with an instrument that checks acuity of hearing 


at different pitches. 


nal tensions of the poorly adjusted 


EMOTIONAL FACTORS The emotio : 
ion, and persistence of effort. 


Student may affect his concentration, motivat ) г 
A student's fear of failure may almost paralyze his efforts; his self- 
Consciousness may cause him to withdraw from participation in class 
activities: as a result of his hostility toward adult authority, he may refuse 
to do assigned tasks; his anxiety to do things well may prevent him from 
developing reasonable speed in reading, handwriting, and other skills; con- 
tinued frustrations may result in a retreat to daydreaming and psychological 
deafness to the teacher’s voice. At adolescence, preoccupation with social 
Status and boy-girl relationships may cause 4 drop in achievement. Methods 
9f studying emotional factors have been considered in Chapters 8 and 9. 


“The Simultaneous Screening and Assessment of School 
Children,” Health and Physical Education, vol. 10 (December 1942), pp. 576-577. 

“т. H. Eames, “The Effect of Correction of Refractive Errors on the Distant 
and Near Vision ‘of School Children,” Journal of Educational Research, vol. 3€ 


(December 1942), pp. 272-279. 


15 Norman C. Wetzel, 


478 THE IMPROVEMENT OF INSTRUCTION 


GROUP DIAGNOSIS 


The use of standardized tests in group diagnosis was discussed in Chap- 
ter 13. Almost any measurement procedure used by teachers can be used 
in diagnosis. For example, an algebra teacher introduced the principles 
involved in the addition, subtraction, multiplication, and division of signed 
numbers, devoted a week to teaching them, and then gave a ten-item test 
covering what had been taught. Although this test was much too short for 
individual diagnosis, the results could aid in group diagnosis. By use of 
the show-of-hands method (discussed in Chapter 10), the teacher tabu- 
lated the errors for each item, as follows: 


ITEM ERRORS 
1 2 
2 1 
3 3 
4 2 
5 3 
6 10 
? 8 
8 9 
9 32 

10 32 


From this simple tabulation, the teacher was able to see that most of the 
difficulty lay in the last five examples. Since examples 9 and 10 had been 
missed by everyone, it was obvious that the concept they involved—sub- 
traction of one negative number from another— needed to be retaught to 
the entire class. The errors on examples 6, 7, and 8 involved the multipli- 
cation or the division of two negative numbers. The teacher could ask 
students who missed these problems to come to the chalkboard during a 
supervised study period; he could then give further explanation, answer 
their questions, and require them to work additional problems of this 
type. 

A. simple group diagnostic analysis of this sort allows the teacher to 
locate quickly the specific items on which many of the students are having 
trouble. These items can then be studied to identify what is causing trouble. 
Review and reteaching can then be focused on the learning needs of the 
students. 

If a test is designed for machine scoring and an item-count attachment 
is available on the local IBM test-scoring equipment, a report such as the 
one shown in Figure 14.4 can be easily obtained. This chart summarizes 


Educational Diagnosis 479 


INTERNATIONAL TEST SCORING MACHINE 
GRAPHIC ITEM COUNT RECORD 


IM rome LT, твое па ага 


TEST. У = PART. QUESTION: /— 90 


RESPONSES: micis [VÍ WRONGS RESPONSES No. oops Г] evens O coup État 


| 
| 


| 
to] 9) soo| soal 


| 
I 


PE 


ss eoj өз! по "| 


Fig. 14.4 ltem Count of the Number of Correct Responses of 100 Stu- 


dents to Each of 90 Items of a History Test. 


Reproduced by permission of the International Business Machines Corporation. 


departmental history examination. 
electronically, an analysis can be 
tes. If the test has more than 90 
h group of 90 or fewer 


Student successes on items 1 to 90 of a 
Prepared by counting student responses 
made of a hundred papers in ten minu 
questions, the process must be repeated for eac 
items,17 


12 ТЕ the teacher wishes to have separate analyses for the high-scoring or low- 
Scoring halves of his group (or the high-scoring and low-scoring fourths), he need 
only group the papers and request that the scoring machine operator record results 
for each group on a separate graphic item count record. Then, item analysis results 
are easily transferred to the test itself, or to test-item cards in the manner suggested 


in Chapter 10. 


480 THE IMPROVEMENT OF INSTRUCTION 


It is evident from Figure 14.4 that several questions were so easy that 
they had little measurement value. Others were missed so frequently that 
they may indicate a need for reteaching the concepts involved. (The possi- 
bility that a question is difficult because of ambiguity should always be 
considered.) If a teacher is using a standardized or departmental test, he 
may wish to go through the test before the item count is made and mark 
the questions that he feels have curricular validity for his instructional 
program. Then, from the results of the item analysis, he can see how well 
his students succeeded on those test questions that he considers to be valid 
and significant. 

In the manuals for many standardized achievement tests, data are 
given regarding the percentage of students in the representative norming 
sample who answered each item correctly. The teacher can compare the 
results of his own item analysis with these data in order to determine for 
each item the relative standing of his class or classes with respect to these 
percentage-correct norms. This comparison has added significance if the 
teacher has first evaluated all items of the standardized test in terms of their 
curricular validity for his own instructional program. 


BASIC PRINCIPLES OF CORRECTIVE INSTRUCTION 


Three essential steps in educational diagnosis have been explored: (1) 
identifying the students who are having trouble, (2) locating the errors or 
learning difficulties, and (3) discovering the causal factors, The results of 
such diagnosis have significance only if they constitute the basis for correc- 
tive instruction and for remedial procedures that remove, alleviate, or 
compensate for causal factors in the student and his environment. 

It is not within the scope of this book to outline corrective procedures 
for each of a wide variety of learning difficulties and causal factors. In 
fact, the learning process and the dynamics of human motivation are so 
complex that it is undesirable, if not impossible, to match corrective pro- 
cedures to learning problems in any rote or mechanical fashion.? 


18 Бог example, the tests of the Evaluation and Adjustment Series (Harcourt, 
Brace & World, Inc.), and the STEP tests (Educational Testing Service). 

19 General suggestions, and selected lists of remedial materials in several subject 
fields are given in Chapters 11 through 14 of the authors’ textbook for elementary 
School teachers and Chapters 13 through 16 of the authors’ textbook for secondary 
School teachers. T. L. Torgerson and Georgia Sachs Adams, Measurement and 
Evaluation for Elementary School Teachers (New York: Holt, Rinehart and Win- 
ston, Inc., 1954); Georgia Sachs Adams and T. L. Torgerson, Measurement and 


Evaluation for Secondary School Teachers (New York: Holt, Rinehart and Winston, 
Inc., 1956). ° 


Educational Diagnosis 481 


An important distinction should be made between corrective instruction 
in information and understandings and corrective instruction in the basic 
skills. If a teacher can identify several students who lack a thorough un- 
derstanding of certain concepts (for example, latitude and longitude), he 
may reteach these concepts through group instruction, demonstrations, and 
supplementary reading by the students. General retardation in the content 
Subjects, however, is frequently due to inadequate mastery of the basic 
skills of reading, arithmetic, or language, or to inadequate command of the 
work-study skills. Hence, corrective work in the basic skills plus improved 
motivation in the content subjects may be sufficient to effect improvement. 

Deficiencies in the basic skills of reading, arithmetic, and language can 
be corrected only in part by special group instruction or by individualized 
assistance during supervised study. In these fields, the best results can be 
achieved only by systematic, meaningful practice on instructional materials 
designed to develop the specific skills that the individual has failed to 


master, 


Selection of Materials 


Selection of corrective materials for a student is a crucial aspect of his 
Corrective instruction. Any materials selected should meet the following 


criteria: 


e materials should be geared to the student's 


l. The di а 
m e обыр skill to be improved. If the student's 


readiness or maturity in the subject or l t 
interest is to be wir ай, пос вае instruction must result in feelings of 
accomplishment. The student's grade equivalent on a standardized test may 
be used as a partial basis for selecting the level of instructional materials 
(о be used. A set of remedial materials without grade labels, which provides 


for a wide range of difficulty, should be used. eer 
The bier. materials should be designed to correct the student's indi- 


vidual difficulties. By means of observation, jos аде pee vem 
materials, the teacher will have analyzed the work o ue gem ed stu =“ 
in order to locate his specific retraining па ае е о 5 
corrective material, designed to correct the specific difficulties discovered, 


should be provided. 
+ The corrective materia 
instructional program cannot achieve 0) 
terials are self-directive, permitting a 
pendently on different materials. Writte 
stood by the students, must accompany 
direction by the teacher is required. 

he corrective materials must permit іт 
A method should be provided for reco 

hen the student has an opportunity 
Tecord, he is given an additional incentive 


Is should be largely self-directive. An individualized 
ptimum effectiveness unless the ma- 
number of students to work inde- 
n directions, easily read and under- 
the materials, so that a minimum of 


dividual rates of progress. 

rding evidence of individual progress. 
to record his successes on a progress 
to achieve. 


482 THE IMPROVEMENT OF INSTRUCTION 


The reader will recognize that well-designed programmed instruction 
meets these criteria. 


Planning and Carrying Out the Program 


Although the selection of remedial materials is highly important, it is 
only one aspect of the teacher's attack upon learning difficulties and under- 
lying causative factors. The following principles should guide the teacher 
in planning and carrying out the program: 


1. One of the first steps should be the correction of any physical factors that 
affect learning, such as defects of hearing or vision. 

. The cooperation of the parents should be obtained in correcting such physi- 
cal factors, alleviating emotional tensions, providing better study conditions, 
and the like. 

3. If the student seems to have little desire to learn, immediate steps should be 
taken to try to improve his attitude through providing activities in which he 
can enjoy success, receive praise for his efforts, and be given opportunities 
to develop his special interests or use his special skills. Personal interviews 
may do much to establish rapport and provide leads that will help the 
teacher know each student's interests and problems. 

4. Corrective instruction should begin by analyzing with the student his specific 
strengths and needs and showing how the instructional materials are de- 
signed to correct his deficiencies. When the student is helped to face his 
problems constructively and provided with aids to solving them, he can 
usually take the first steps that lead to early evidence of progress. 

5. Instruction should begin at, or slightly below, the learner's present level of 
achievement. Short-term goals should be established which the learner con- 
siders reasonable and possible of attainment. By means of progress charts, 
praise, and social recognition, the student's feeling of successful accomplish- 
ment should be reinforced. 

6. Since corrective instruction must usually proceed on the basis of a tentative 
diagnosis, the teacher must be ready to modify the remedial program if the 
approach and materials selected seem to be ineffective. 

7. The results of corrective instruction should be evaluated—that is, compara- 
ble forms of a standardized test should be administered before and after a 
period of concentrated instruction. The effectiveness of the program must 
be evaluated for each student rather than merely in terms of class averages. 

8. A record should be made of the results of each student's diagnosis, of 
methods and materials used, and of the results of corrective instruction. 
Such a record is not only helpful in the determination of next steps; it is 
likely to be invaluable to the next teacher if the student continues corrective 
instruction. 


ә 


In individualized instruction, the teacher is constantly reminded of a 
principle that he frequently overlooks in other teaching situations—that 15, 
that learning rather than teaching is the goal of his activities, As empha- 
sized earlier, the growth of each individual—rather than the change in 


Educational Diagnosis 483 


group averages—is the criterion of success. Hence, the teacher needs a rich 
background in psychology and educational diagnosis, as well as consultant 
help from specialists, in order to attack successfully the variety of indi- 
vidual problems that present themselves. 


SUMMARY STATEMENT 


The first step in diagnosis is to identify the students who require further 
study. Low achievement may result chiefly from limited learning ability. It is 
therefore desirable to compare a student's achievement with some measure of 
expected achievement or with the distribution of test scores for students of 
his mental maturity level, as shown on an expectancy chart. Low achieve- 
ment calls for detailed diagnosis only if it is significantly below expectancy for 
the student. 

The second step is to locate and study the specific errors or difficulties, using 
tests that are valid and reliable for this purpose. Testing can be followed by a 
diagnostic interview to discover how and why the errors were made. Teacher- 
made tests can also be used as diagnostic instruments, provided the teacher 
Constructs tests that provide many opportunities for students to make crucial 
errors and takes the time to analyze the specific mistakes made by the students. 


The third step is to discover the causal factors. Causes of poor learning can 
be grouped under five general headings; disabilities in the basic skills, inade- 
quate work-study skills, scholastic aptitude factors, physical factors, and 
*motional factors. меке - 

Corrective instruction in the skills should be individualized, permitting each 


Student to work independently at his own rate on materials that have been 
Selected or designed to correct his specific deficiencies. Appropriate materials 
Гог individualized instruction should provide for self-direction and be of the 
Proper difliculty to ensure successful performance. Motivation should be im- 
Proved through providing success experiences, praising efforts, апа developing 
good rapport. The results of corrective instruction should be evaluated in terms 
of growth for each student, and a record should be made of methods and 


Materials used and gains effected. 


SELECTED REFERENCES 


ating Progress of Retarded Readers in 
Yearbook, National Council on Meas- 
The Council, 1958, pp. 128-134. 
* in C. W. Harris, ed., Encyclo- 
k: The Macmillan Company, 


BLIESMER, EMERY P., “Methods of Evalu 
Remedial Reading Programs.” 15 
urements Used in Education. New York: 

BRUECKNER, LEO L., “Diagnosis in Teaching,’ 
pedia of Educational Research. New Yor 
1960. 

— ——À. AND GUY L. BOND, The Diagnosis а 


New York: Appleton-Centur -Crofts, 1955. . + 
KIRK, BARBARA А E Test сата, Academic Performance in Malfunctioning 


Students," Journal of Consulting Psychology, vol. 16 (June 1952), pp. 
213-216. 


nd Treatment of Learning Difficulties. 


484 THE IMPROVEMENT OF INSTRUCTION 


KOUGH, JACK, AND ROBERT F. DEHAAN, Helping Children With Special Needs. 
Teachers Guidance Handbook, Part II, Elementary Edition. Chicago: 
Science Research Associates, Inc., 1956. 

‚ AND , Identifying Children Who Need Help. Teacher's Guidance 
Handbook, Part I, Elementary Edition. Chicago: Science Research Asso- 
ciates, Inc., 1955. 

LINDQUIST, E. F., ed., Educational Measurement. Washington, D.C.: American 
Council on Education, 1951, Chapter 2. 

TRAXLER, ARTHUR E., "The Use of Tests in Differentiated Instruction," Educa- 
tion, vol. 74 (January 1954), pp. 272-278. 

TRIGGS, FRANCES O., AND OTHERS, Diagnostic Reading Tests: Their Interpreta- 
tion and Use in the Teaching of Reading. New York: Committee on Diag- 
nostic Reading Tests, Inc., 1952. 

WOOD, ERNEST R., "Subject Disabilities: Special Difficulties in School Learning," 
in Charles E. Skinner, ed., Educational Psychology, 3d ed. Englewood 
Cliffs, N.J.: Prentice-Hall, Inc., 1951, pp. 484—521. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. Consider the tenets discussed in this chapter, and identify those that were 
followed in the schools you attended as a pupil. 

2. Compare survey and diagnostic tests with respect to the purposes they 
serve. Cite a specific test of each type in one subject field. 

3. What additional data would you like to have concerning Alfred and 
Winifred (underachievers in Fig. 14.2) in order to develop hypotheses as to 
why they are not working up to ability level? 

4. Analyze and evaluate a diagnostic test in silent reading (or arithmetic) 
and indicate whether differences between subtest scores can be interpreted with 
confidence. 

5. What aspects of reading disability may be revealed through the use of an 
oral reading test? 

6. Select a diagnostic test in arithmetic and describe the skills and abilities 
that it is purported to measure. 

7. Evaluate several specimens of handwriting on the basis of a diagnostic 
handwriting scale. 


PART FOUR 


Administrative, 
Supervisory, 


and Guidance Aspects 


15 Planning and Administering 


tbe Evaluation Program 


In Part Three, major emphasis was placed on the use of tests and other 
evaluation techniques by the classroom teacher. Since teachers have con- 
Siderable freedom in the planning of educational experiences for their stu- 
dents, they must inevitably assume greater responsibility for evaluating the 
Worthwhileness of those experiences. Moreover, to the extent that schools 
are committed to the ideal of meeting student needs and individualizing 
instruction in terms of those needs, the teacher becomes the key person in. 
interpreting measurement data and in putting such data to use in individu- 
alizing instruction and in guiding students. 

Certain aspects of the modern evaluation program, however, are school- 
Wide and system-wide in scope. For example, the objectives that the teacher 
develops as a guide to his own evaluation activities should be in harmony 
With the common objectives of the educational program, as cooperatively 
developed on a school-wide or system-wide basis. Decisions must be made 
On at least a school-wide basis concerning the types of measurement data 
that should be obtained and recorded for all students. Agreement should 

© reached, also, on the most valid, reliable, and practicable instruments 
to be used for obtaining such data. In the selection and development of 
diagnostic tests, as well as a variety of teacher-made tests, rating scales, 
and other evaluation instruments, it is desirable for groups of teachers to 
Coordinate their efforts. In-service education is needed in the administra- 
tion, scoring, and interpretation of tests, as well as in the use of the more 
Informal] evaluation techniques, in order that the results will be valid and 
reliable, 

In this chapter we are concerned primarily with the school evaluation 
Program as it involves (1) the cooperative development and clarification 
9f school objectives; (2) the administration and use of standardized tests 


487 


488 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


in obtaining certain data for all students; and (3) the cooperative selection 
and development of supplementary evaluation materials to be available to 
all teachers as needed. 


FUNCTIONS OF THE EVALUATION PROGRAM 


Although the distinctions among the functions are by no means clear-cut, 
data provided by an evaluation program can be used to implement certain 
(1) administrative-supervisory, (2) instructional, and (3) guidance 
functions. 


Administration and Supervision 


From the administrative-supervisory point of view, measurement data 
for class and grade groups can assist in determining needs for supervisory 
assistance or for a revision of curricular activities and instructional mate- 
rials. Measurement data are indispensable in evaluating student progress 
toward educational goals and reporting on such progress to the nen 
board and the community. The measurement program also affords data 
needed for the maintenance of the cumulative records essential in the 
transfer and promotion of students. Measurement data aid in the making 
of administrative decisions concerning the classification of students en- 
abling the administrator to reduce the heterogeneity of classes ar to identify 
students to be assigned to special remedial groups. Tests are indis ensable 
aids in the selection of students by college admissions officers Ы 


Instruction 


Evaluation serves four major instructional functions: (1) to determine 
the extent to which students are making progress toward instructional 
goals; (2) to provide evidence to students and their parents concerning 
such progress; (3) to ascertain group and individual retraining needs as 2 
basis for grouping within the class, and individualized consenting instruc- 
tion; and (4) to aid in developing hypotheses concerning basic causal 
factors behind any deficiencies. In carrying out the first and second func- 
pae ы teacher is concerned with measuring both short-term gains as 
1 cem е of instruction, and long-term gains over a period 


Guidance 


The guidance functions of evaluation are closely interrelated with the 
instructional functions. They are chiefly concerned with the use of evalua- 


Planning and Administering the Evaluation Program 489 


tion data in (1) helping the counselor to assist students in making wise 
decisions in the areas of educational and vocational guidance; (2) helping 
counselors and teachers to understand the students’ needs in the area of 
personal-social adjustment; (3) helping students toward improved self- 
appraisal and self-direction; (4) increasing the parent's understanding of 
the special strengths and needs of his son or daughter as relevant to needed 
decisions; and (5) assisting in the identification of students whose special 
abilities or disabilities require unusual modifications in the educational 
program or referral to specialists. 


CHARACTERISTICS OF AN EFFECTIVE 
EVALUATION PROGRAM 


The functions of evaluation listed above cannot be achieved by an un- 
Coordinated, hit-or-miss program or by a testing program limited to the 
measurement of the basic skills. In order to serve all functions effectively, 
ап evaluation program should meet the following criteria: 


l. It should be based on a realistic statement of educational objectives. The 
educational program is designed to bring about certain changes in student 
behavior—to teach students to do certain things better, to better understand 
Concepts and principles, to make certain choices more wisely, than if 
they had not had the educational experiences provided. The evaluation 
program should provide evidence on the extent to which these changes in 
student behavior have been achieved. 

+ Evaluation should be comprehensive. 

Student growth toward each major objectiv 

omitted in evaluation just because progress 

With high objectivity. "The importance of the outcome, rather than the 

Precision with which its attainment can be measured, governs the effort to 

be devoted to obtaining and appraising the evidence."* . 

The analysis of educational objectives and the subsequent planning of the 

evaluation program should be a cooperative undertaking. All teachers should 

Participate in the group thinking required to establish and analyze goals 

and in the selection or development of instruments used in evaluation. At 

the classroom level, students should participate in analyzing objectives as 
they see them, in setting up criteria for evaluating achievement in specific 
activities, and in appraising individual and group products. Parents should 
also participate in planning those aspects of the evaluation program that 

are concerned with the reporting of progress (0 the home. | i 

- Evaluation should be a continuous process, providing a longitudinal picture 
ОЁ each student's growth rather than an occasional cross-sectional survey of 
current status. That is, evaluation should be used in assessing through pre- 


Plans should be made for evaluating 
e. Important goals should not be 
toward them cannot be appraised 


~ 


Scores as Evidence of Effectiveness of 


* Warren G. Findley, "Gain vs. Status 5 
cil on Measurement in Education (Ames, 


vestruction," 19th Yearbook, National Coun 
9Wa: The Council, 1962), p. 45. 


490 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


tests students’ previous learnings, in appraising student growth in short units 
of instruction, in revealing to students the specific inadequacies in their 
achievement, in accumulating for the teacher the information necessary to 
plan next steps in instruction, in providing an objective basis for reporting 
to students and parents, and in serving many other functions that are integral 
parts of the instructional process. 

5. The evaluation program should be flexible. Certain tests can be used to 
advantage on a school-wide basis and at scheduled times. Still other evalua- 
tion instruments should be available to teachers to be used whenever they 
contribute to the instructional program. Many instruments, such as student 
self-rating scales, will be developed in the classroom by teachers and stu- 
dents working and thinking together. Diagnostic tests and remedial materials 
should be available for use with individual students as needed. 

6. All evaluation activities should be keyed closely to the local situation. Al- 
though many published tests will be usable, local norms and expectancy 
tables should be developed. Some instruments will have to be developed 
locally. Unless the content of a published achievement test is closely related 
to the objectives emphasized in the local curriculum, it should not be used 
to measure outcomes just because it has prestige or is readily available. 

7. The functions of the evaluation program should not be limited to an ap- 
praisal of group progress. Most of the tests used in an evaluation program 
should have sufficient validity and reliability to be useful in guiding indi- 
vidual students and appraising their progress. 3 

8. The evaluation program should utilize a variety of techniques. Evaluation 
is not limited to the use of paper-and-pencil tests; it includes all means of 
obtaining valid data concerning desired changes in student behavior. 


PLANNING THE EVALUATION PROGRAM 


If school evaluation programs are to meet the criteria listed above, much 
cooperative effort is needed. In the typical school system, practice lags far 
behind accepted theory. Measurement, evaluation, diagnostic study, an 
corrective instruction should not be considered supplementary activities 
to be performed if teachers find time to do so. They are essential aspects 
of effective teaching, which facilitate learning and should become an in- 
tegral part of the instructional activities of every teacher. 

The planning of a functional evaluation program must be based upon 
cooperative decisions reached within the school on (1) what types of judg- 
ments and decisions need to be made about students and by students and 
(2) the general and specific objectives of the educational program. The 
answers to the first question will help in the selection of tests ia be used 
in making institutional decisions and helping students with their educational 
- ‚апа vocational choices. Answers to the second question will aid in plan- 

ning an evaluation program that will help to assess student progress and 
evaluate the effectiveness of the instructional program. 


Planning and Administering the Evaluation Program 491 


The necessary steps in planning an evaluation of the instructional pro- 
gram include: (1) the cooperative development of a realistic list of ob- 
jectives for the educational program; (2) the analysis of these objectives 
in terms of changes in student behavior; (3) exploration of possible tech- 
niques and instruments with which to obtain evidence concerning changes 
in student behavior; and (4) the selection and development of the actual 
instruments to be used. 


Cooperative Development of Objectives 


Although many excellent lists of objectives are available, they tend to 
be so generalized that they provide little guidance in the selection of cur- 
ricular activities and the development of evaluation instruments. The 
principal of each school, with the aid of subject supervisors, should lead 
his teachers in the cooperative development of a realistic list of intermedi- 
ate objectives in each subject area that are related to the ultimate objectives 
of the educational program. . 

. Тће process of group thinking that a teaching staff goes through in stat- 
Ing realistic objectives for its own students is an exceedingly valuable one. 
Goals that have seemed to be only attractive slogans take on real meaning, 
and their implications for both the content and the method of daily class- 
room activities are recognized. Wrinkle suggests SIX basic criteria for 
Judging the value of each objective in a locally developed list of goals. 
"Is the objective (1) understandable, (2) stated as a behavior, (3) based 
Upon the needs of the learner, (4) socially desirable, (5) achievable, and 
(6) measurable?" . a | 

Teachers who are working cooperatively on а list of objectives for their 
Subject area will want to clarify the meaning of each objective by stating 
It in terms of changes in student behavior. For examples of objectives 
Stated in behavioral terms, the reader is referred to Table 11.1. 

After Objectives have been stated in behavioral terms, they should be 
Organized under a few major headings. The grouping of interrelated objec- 
tives in this way assists in the planning of an evaluation program. For 
ехатрје, if all the objectives involving interests are grouped together, it 
тау be possible to develop an interest inventory that will obtain evidence 
Оп a number of related objectives. Similarly, if the objectives relating to 
Study skills аге grouped together, a test of study skills can be selected or 
developed that will measure student knowledge and skill in a number of 


these Objectives. 


? William L. Wrinkle, Improving Marking and Reporting Practices (New York: 


Holt, Rinehart and Winston, Inc., 1947), p. 97. 


492 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 
Survey of Techniques and Instruments 


Since effective measuring instruments are difficult to construct, it is ad- 
visable to use published test materials whenever they provide valid data 
for the purpose and group for which they are being used. In measuring 
the scholastic aptitude of students, published intelligence tests are indis- 
pensable. Published achievement tests of the survey type are necessary to 
determine the present status of individuals or groups or to measure their 
progress over a period of a year or more. Published diagnostic tests aid 
the teacher in identifying students specific retraining needs, although the 
use of observation and interview often increases the validity of diagnosis. 

In most subjects, however, teacher-made tests must be relied on for 
measurement of short-term growth. In such tests, most questions relate 
to the learnings recently emphasized in the local instructional program. 
Moreover, for many significant objectives of instruction, no published tests 
are now available. For some of these objectives, the teacher may always 
have to rely on such informal and relatively subjective means as observa- 
tion, interview, and self-rating techniques. Hence, his conclusions must be 
tentative unless data obtained by independent methods tend to converge 
or agree in supporting a generalization or conclusion. 

The staff should develop a source list of possible techniques and instru- 
ments for each of the major objectives. As an illustration, such a list has 
been prepared for two major objectives. 


OBJECTIVE POSSIBLE TECHNIQUES AND INSTRUMENTS 

The student California Study Methods Survey 

uses California Tests in Social and Related Sciences, Elementary, sub- 
effective tests on "Reading of Maps" and "Knowledge of Geographic 
study Terms" 

skills Iowa Tests of Basic Skills, Test W, Work-Study Skills 


Peabody Library Information Test 

Spitzer Study Skills Test 

Survey of Study Habits and Attitudes 

Tyler-Kimber Study Skills Test 

Wrenn Study Habits Inventory 

Informal teacher-made exercises on the i 
graphs, and tables 

Teacher-made tests of study skills 

Teachers’ checklists for guided Observation of students’ work on à 
specific library assignment 

Instructional tests issued by publishers on the use of the dictionary: 
encyclopedia, and other reference books 

Self-rating checklists on study habits 


(Brown-Holtzman) 


nterpretation of maps 


Planning and Administering the Evaluation Program 493 


OBJECTIVE POSSIBLE TECHNIQUES AND INSTRUMENTS 
The student Anecdotal records 

cooperates Systematic observation of students, using code numbers for nota- 
effectively tions on significant behaviors 

in class Observation of students’ attitudes revealed in role-playing and 
activities dramatic play | 


Recording evidence of students' participation and contributions in 
cooperative group enterprises, such as bringing relevant material 


to class 
Lists of criteria (developed by teacher and students) on the char- 


acteristics of a good class discussion and other skills of group 


work 
Self-rating charts based on such criteria 
Having students write out the endings for reaction stories concerned 


with cooperation in a group 
Informal teacher-made tests on knowledge of the skills of group 


work (for example, functions of the chairman and secretary of a 
committee, parliamentary procedures, and the like) 

Published tests, such as the Behavior Preference Record, or the 
subtest on “Understanding of Democracy" from the California 
Tests in Social and Related Sciences 

Published or teacher-made attitude scales 


In the process of preparing such source lists, one should explore many 
Possible techniques and instruments for evaluating growth toward each 
Major goal of the educational program. 

A school staff that has carried out the steps in planning outlined above 
has already laid the groundwork for the selection and development of 
evaluation instruments. As teachers have thought through in detail the 
meaning of each objective, they have partially developed bases for select- 
ing the published instruments that meet their needs; they have partially 
formulated many criteria for evaluating learning products; they have de- 
Veloped the “raw material" for observation guides and rating scales on 
Sportsmanship, cooperation, responsibility, and similar objectives. 


Selecting Standardized Tests 


Selecting valid tests is a major problem for the school. There are large 
numbers of published standardized tests from which to select but a rela- 
tively small number of usable tests with high validity for a specific purpose. 
t is generally desirable to use the same intelligence and achievement test 
atteries over a period of years so that comparable longitudinal data can 
be obtained and so that teachers will achieve familiarity with each test’s 


494 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


values and limitations. For this reason, great care in the selection of such 
tests pays dividends over a period of time. 

Before a standardized test is adopted for school-wide or system-wide 
use, it may be advisable to try it with representative classes and analyze 
the results. Such an analysis may show that the test is too easy or too 
difficult, or may disclose problems with respect to adequacy of instructions, 
time required in scoring, or other factors that would aflect the test’s 
usability. 

The selection of tests can never be a routine or entirely objective pro- 
cedure. Professional judgment is needed in deciding the relative importance 
of certain features of a test to be used for a specific purpose. The basic 
concepts involved in the appraisal of tests were presented in Part One and 
summarized in Chapter 5. 


GUIDANCE WORKERS AND PSYCHOLOGISTS AS 
RESOURCE PERSONS 


Principals and teachers often feel overwhelmed by the many decisions that 
have to be made in setting up a good evaluation program or in improving 
one that has been only partially effective. Fortunately, an increasingly large 
number of schools can request assistance from persons with special com- 
petence in measurement and evaluation. Large school districts often have 
a specialist in measurement, employed in a department of research and 
guidance. In smaller school districts assistance may be available from the 
county school department or from a nearby college. 

Guidance specialists (such as directors of guidance or counselors) have 
usually had considerably more training in child-study and evaluation tech- 
niques than most teachers and supervisors. If a guidance specialist is 
assigned time for this purpose, he can be very helpful in the processes 
of selecting tests, planning the testing schedules, instructing teachers in the 
many details of administration and scoring, and planning and supervising 
an efficient system of record-keeping. 

The functions of school psychologists vary considerably from one school 
system to another. Like the guidance workers, the school psychologists are 
generally well versed in child study and evaluation and can serve as con- 
sultants in all phases of an evaluation program. Their special contribution 
is usually in the individual testing of students who deviate far from the 
average in intelligence, personality, achievement, or conduct. 

The administrators, supervisors, and teachers in a system that employs 
competent guidance workers and psychologists are fortunate in being able 
to call upon them for help in many aspects of the evaluation program. 


Planning and Administering the Evaluation Program 495 


Without specialized personnel, the problems of evaluation are more diffi- 
cult, but they can be solved by guidance-minded administrators and teach- 
ers with released time and special training. 


PLANNING THE TESTING PROGRAM 


Тће responsibilities of all school personnel (teachers, administrators, super- 
Visors, and guidance workers) must be kept in mind in the planning of 
the testing program and the selection of tests. 


Aptitude Testing 


Ideally, group intelligence tests should be administered in alternate years 
beginning with grade 2 or 3 and continuing throughout the elementary 
School years. In the first grade, a reading-readiness test seems preferable 
to an intelligence test. Not only is a reading-readiness test valuable in 
helping assign children to reading groups but teachers are less likely to 
make premature generalizations about the general learning ability of young 
children than if an intelligence test is given. Nu | 

For the typical school, tests of general mental ability, heavily loaded 
With verbal and numerical content, would seem to provide the best divi- 
dend for a given investment of testing and scoring time. Such tests provide 
а better basis for assessing expected achievement than do tests of equal 
length that include spatial and perceptual factors less closely related to 


achievement in school. 
. For students who are markedly ret 
intelligence tests may grossly underestimat 
School achievement, once his reading handicap has been removed. For such 
Students, the use of a supplementary intelligence test that does not depend 
unduly on reading skill is necessary. Schools that enroll a large percentage 
9f children who are bilingual or come from culturally deprived areas, may 
Prefer to give nonverbal, as well as verbal, tests to all pupils. 
_ At the junior high school level, ability tests that provide separate scores 
in verbal and numerical abilities and possibly other factors of mental, abil- 
Чу are valuable in helping students make decisions concerning elective 
Subjects and tentative vocational choices. : 
If the testing budget permits, and qualified staff members can provide 
leadership to students and teachers in a self-appraisal program, the admin- 
‘stration of a multiscore intelligence test or ап aptitude test battery is de- 
Sirable in the ninth or tenth grade. A school staff, however, should not 
embark on such a program unless they are willing to allow enough testing 


arded in reading, however, verbal 
e the student’s potential for 


496 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


time to obtain reliable scores on each aptitude. A two-factor test, giving 
scores on verbal and numerical aptitudes only, is preferable to a battery 
giving scores on five or more abilities if the subtests of the battery are too 
short to provide reliable scores. | 

Prognostic tests in algebra, geometry, foreign language, and shorthand 
might be included to advantage in some testing programs. Adequate ex- 
pectancy charts can usually be constructed on the basis of intelligence test 
data and previous grades in relevant subjects; however, prognostic tests 
may be preferred because of the greater ease with which special aptitudes; 
rather than general scholastic aptitude, can be discussed with students and 
parents. 


Achievement Testing 


A minimal achievement testing program in the primary grades (grades 
1—3) would involve the testing of reading skills in grade 2 and the testing 
of reading, arithmetic, and language in grade 3. In the upper elementary 
grades, annual administration of a test battery in the basic skills is rec- 
ommended. The use of separate answer sheets with upper grade students 
may make possible the inclusion of tests in work-study skills, social studies, 
and science at the fifth- and sixth-grade levels with only moderate addi- 
tional cost. 

At the secondary school level achievement testing is of two types: (1) 
the school-wide testing program in the basic skills, and in the skills and 
content learnings of required general-education courses; and (2) the con- 
tinuous day-by-day evaluation and the end-of-course testing in specific 
subjects. The second aspect of the program has been considered in Chap- 
ters 10 through 13. 

Emphasis on the first aspect of the program has increased in recent 
years as high schools have assumed greater responsibility for the develop- 
mental and corrective teaching of the basic skills, as teachers have learned 
to use test data on students’ cumulative records, and as school staffs have 
felt the need for having periodic evaluation data, both for appraising their 
own instructional programs and for informing the public concerning their 
effectiveness. The development of machine-scoring techniques and other 
rapid-scoring devices has helped to reduce the cost, and increase the prac- 
ticability, of large-scale testing programs at the high school level. 


A basic program of high school achievement testing should probably 
include: 


1. The administration of a basic 
students (preferably during th 
data would be available as a 
seventh-grade classes). 


-skills battery to all incoming seventh-grade 
e last month of the sixth grade so that the 
n aid in the programming of students into 


Planning and Administering the Evaluation Program 497 


N 


. The administration of a basic-skills battery to all incoming ninth- or tenth- 

grade students entering four- or three-year senior high schools, respectively. 
3. Administration of a test battery to evaluate student achievement in the 
skills and content learnings of general-education courses. 


The following batteries are typical of tests available for use in evaluating 
the outcomes of general-education courses at the secondary school level: 


California Tests of Social and Related Sciences, Advanced (Parts I and II at 
the completion of American history and Part III at the completion of required 
Courses in general science and biology) 

Cooperative General Achievement Tests, grades 9-13 

Essential High School Content Battery, grades 10-13 

Towa Tests of Educational Development, grades 9-12 

Metropolitan Achievement Tests, Advanced, grades 7-9 

Sequential Tests of Educational Progress, grades 7-12 


In both the elementary and secondary schools, it is important that every 
Student be tested with a test that is at approximately the right difficulty 
level for him so that there are a large number of items on which he can 
demonstrate his competency. In the Atlanta schools, students are assigned 
to take tests at their reading level.? In other school systems, teachers’ judg- 
ments regarding the level of test that should be administered to each stu- 
dent are used in assigning students to testing groups. If STEP ог SCAT 
tests are used, different levels may be administered at the same time in the 


Same classroom. 


Testing of Interests and Personal-Social Adjustment 


If vocational-interest inventories are used as part of a student self- 
Appraisal program, they should be administered at the same grade level as 
the aptitude test battery so that the interest-inventory results may be in- 
terpreted in the light of those from aptitude tests. The hazards of interpret: 
ing vocationalinterest tests without accompanying data on students 
abilities have been emphasized in Chapter 7. Because vocational interests 
change and mature, and because one interest inventory can serve as a 
CTOss-check on another, many students may desire to take a second voca- 
tional-interest inventory in the eleventh or twelfth grade as part of a vo- 
Cational-planning unit or of an elective course for students who need 
additional help in life planning. MEM 

Tests of personal-social adjustment should be used only in situations 
chievement Tests in Relation 


* Wa i “ d Interpretation of А 
Eten б. Hindley, "Use mme Шер ducation (Ames, Iowa: The 


2 Validity,” National Council on Measurement in E 
9uncil, 1961), pp. 23-34. 


498 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


in which the results are likely to be wisely interpreted and when psycho- 
logical service is available to help individual students who need such 
assistance. In the hands of qualified staff members, personality inventories 
can be used to advantage, especially in identifying (with the help of teacher 
observation and other techniques) individual students who should be re- 
ferred for special study. 

Youth-problem checklists probably involve less risk of misinterpretation 
and are more meaningful to counselors, teachers, and students than the 
type of inventory that yields a profile of personality components or traits. 
Insufficient research has been done on the construct validity of most per- 
sonality inventories. The subtests on which the profile is based may be so 
unreliable that a change in the student’s answers to only two or three 
questions may cause a marked shift in percentile rank. The danger that 
too much may be “read into” such a Personality profile by the naive user 
should not be dismissed lightly. 


Supplementary Testing 


Up to this point, we have considered tests to be given to all pupils. 
Supplementary testing is essentially of three types: 


1. Diagnostic testing of students tentatively selected for remedial work in the 
basic skills of reading, arithmetic, and language. 

2. The administration of special interest and aptitude tests to students who need 
additional help with problems of educational and vocational guidance (for 
example, a group test of engineering aptitude to seniors considering a career 
in engineering, individual tests of manual dexterity to students considering 
careers as dentists or dental technicians, or the Strong interest inventory 
and a number of aptitude tests to a handicapped student or any other student 
facing special problems of vocational choice), 

3. The administration of individual intellig 
and personality tests of the projective type 
or psychologist because of marked and pe 
problems of personal-social adjustment. 


ence tests, personality inventories, 
to students referred to the counselor 
Tsistent underachievement or serious 


For those students whose group-intelligence-test results аге markedly 
inconsistent, individual intelligence tests are desirable. In addition, all stu- 
dents with group-test IQ's of 75 or below should be tested individually, 
as well as all those whose physical or emotional 


handicaps would tend to 
invalidate group-test results. 


Scheduling of Tests during School Year 


Тће basic or school-wide testing program is usually scheduled for either 
the beginning or the end of the school year. One advantage of a consistent 


Planning and А dministering the Evaluation Program 499 


policy in this respect is that a year (or a multiple thereof) intervenes 
between successive testings. Hence, data on student growth are more easily 
interpreted from the records. 

Fall, or beginning-of-year, testing has certain clear-cut advantages: (1) 
It permits the teacher to obtain a complete test record for each student. 
When students have been tested the preceding spring, pickup testing is 
necessary for new entrants. (2) The data are up-to-date. During a long 
Vacation, many students lose in varying degrees their proficiency in certain 
Skills; on the other hand some students have gained in reading achievement 
through their summer reading. Others may have gained in skill subjects 
through attendance at summer school or through special tutoring, (3) 
Fall testing places the emphasis on the analysis of student needs, rather 
than the evaluation of teaching. (4) More time is available for the admin- 
istration and scoring of tests and the analysis of results. End-of-year pres- 
Sures can result in tests being filed away without being used. (5) Up-to-date 
test results can be used as a basis for grouping students for differentiated 
Work ог Special corrective instruction. Moreover, scores on survey tests 
Serve as д starting point for the use of supplementary diagnostic methods 
to determine specific retraining needs. Ру 

End-of-year testing also has certain advantages. If tests are administered 
during the last month of school, teachers and counselors will have (1) 
recent objective data to aid in problems of promotion, and in programming 
Students for the next year and (2) data that can be studied by teachers in 
the fall (either before or just after the opening of school). Some school 
districts have a systematic presession program that includes a study by 
the teachers of the cumulative records of incoming students. Obviously, 
testing of achievement in specific high school subjects involves end-of-year 
testing, 


^DMINISTERING THE TESTING PROGRAM 


Although many teachers and other staff members are involved in the plan- 
ning and carrying out of a testing program, the responsibility for the basic 
Program of standardized testing should be centralized. That is, one well- 
trained staff member at the central office should be working with one 
Well-trained person at each school who will see that the many administra- 
tive details involved in giving and scoring tests are adequately observed. 

ests and answer sheets need to be ordered well in advance of testing; test 
materials need to be distributed to teachers well in advance of use; testing 
Schedules need to be planned and materials routed; examiners and proctors 
iced to be trained in the administration of an unfamiliar test; supplemen- 
tary directions regarding scoring may need to be prepared, and a sampling 


500 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


of work at each stage of the scoring process must be checked. Unless 
testing is done under standard conditions and tests are accurately scored, 
the results will be inaccurate and misleading. It is imperative that respon- 
sibility for administering the testing program be centralized in a person 
with training, experience, and a realization of its importance. 

The responsible staff member will probably find it desirable to prepare a 
bulletin covering all pertinent aspects of the school testing program and an 
abbreviated list of procedures for each test with references to appropriate 
sections in the test manuals. Such a list can be affixed to each package of 
test materials checked out to individual teachers. 


Administration of Standardized Tests 


It is essential in the administration of standardized tests that students 
be tested under standard conditions—that is, with the same instructions 
and the same timing as were the students on whom the test was normed. 


Unless the testing conditions are the same, the norms cannot be considered 
applicable. 


PREPARATION FOR TESTING If valid results are to be achieved, the 
examiner must be thoroughly familiar with the test instructions, must have 
all testing materials at hand, and must have made arrangements so that the 
testing session will not be interrupted. The importance of advance prepara- 
tion cannot be overemphasized. In group testing, there must be no emer- 
gencies. 

The examiner should have so thoroughly familiarized himself with the 
instructions that he can read them with clarity and give proper emphasis 


to key words and phrases. He must have rehearsed every step of the 
process, knowing when and how he must demonstrate a sample exercise, 
when students should be asked to read the directions with him. and the 
like. If a test is timed, the examiner will need a watch with a scond hand 
(or preferably a stop watch, the use of which will give him more freedom 
for the observation of students). The watch should be checked to see that 
it is operating properly. 

Before he assembles his testing materials, the examiner 
list of items needed. This may include scratch paper for 
the test, extra pencils and erasers, and sometimes special 
chine scoring. Once the materials are assembled, they sh 
out by rows (that is, the number of tests, answer sheets. 
to the number of seats in row 1, row 2, and the like). 

If machine-scored tests are used, pencils should be checked to make 
sure that they are sharpened and that the erasers are in good condition. If 


mechanical pencils are used, they should be checked to be sure that each 
pencil contains sufficient electrographic lead. 


should prepare a 
certain parts of 
supplies for ma- 
ould be counted 
and the like equal 


Planning and Administering the Evaluation Program 501 


SCHEDULING OF TESTS Before administering a test, the teacher should 
make sure that the students can complete the test before the end of the 
period. Freedom from distractions is very important. He should protect 
the students from interruption by placing a sign on the door reading “Test- 
ing—No Admittance,” and should inform the office that testing is planned 
so that there will be no interruptions through the interoffice telephone. 

Rapport is extremely difficult to establish if a test is given late in the 
day, on the day before vacation, ог during a period of undue excitement 
(such as before a school athletic event). Hence, these periods should be 
avoided in setting up a test schedule. 

With elementary school pupils, one should avoid having the test extend 
into the recess period. An opportunity to visit rest rooms and get a drink 
should be provided before the test begins. If only part ofa class can be 
tested at one time, arrangements should be made for the supervision of 
the other pupils, outside the classroom. Third-grade classes, as well as some 
Second-grade classes, can be tested as a group; however, the teacher may 
need the assistance of another staff member to serve as proctor, making 
Sure that pupils are following directions, working on the right page, and 
continuing to work throughout the testing period. | 

In planning testing schedules, adequate time should be allowed for dis- 
tributing papers, giving instructions, answering preliminary questions, and, 
following the test, for collection of tests, answer sheets, pencils, and other 
testing materials. It is also important not to expect students, especially 
younger ones, to work too long in one testing session. Usually the manual 
for a test battery will suggest how the testing time should be divided into 
Sessions, It is better to spread testing over several days than to have chil- 
dren become bored or frustrated. 

ТЕ students’ answers are recorded on answer sheets rather than test 
booklets, the school may route classroom sets of test booklets from class to 
class, rather than purchasing booklets for all examinees. If sharing of sup- 
plies is involved, the testing must not be too closely scheduled. When 
classroom sets are routed from class to class, the testing schedule should 
allow enough time between testing sessions to allow for examining all 
booklets and screening out those on which students have recorded answers. 

A library or cafeteria used for testing large numbers f воизи shonla 
be well lighted and well ventilated, and should have satisfactory acoustics. 
The use of tablet arm chairs is not recommended, especially if separate 
answer sheets are to be used.* Seating arrangements should be planned so 
as to provide good working space and minimize cheating. The examiner 
may wish to assign to front seats children who are likely to have difficulty 


in following directions. 


«Effect of Type of Desk on Results of 


аА . N. Hilkert, 
rthur E. Traxler and R vol. 56 (September 1942), pp. 277-279. 


Machine-scored Tests,” School and Society, 


502 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


ТЕ the time limit for the test is liberal, the teacher should plan in advance 
the best procedure to follow with students who finish early. They should 
be directed to work on some activity that will not require them to move 
about the room and that will not require supervision by the teacher. In- 
structions regarding permissible activities should be given before the test 
begins. If a series of tests is being given to a large group, the last test should 
be a closely timed one so that all students will complete their work at the 
same time. 

Questions after testing has begun should have been discouraged by the 
examiner's initial presentation and should have been made unnecessary by 
the clarity of his instructions and examples. If an individual student is 
perplexed about procedures, a proctor can repeat original instructions; 
but if a student asks for help on a specific item, he must be told, “I’m 


sorry that I cannot answer your question. Do your best. If you are stuck, 
go on to the next question.” 


Communication and Rapport 


One of the most difficult aspects of test administration for the teacher 
is that of “doing enough but not too much” in helping students to under- 
stand the test. The teacher’s role as an impersonal examiner is an un- 
natural one for him. In his eagerness to help students understand the test, 
the teacher may be tempted to reword the directions, Such 
only violates the requirement of standard conditions but also may reduce 
the effectiveness of communication to the students. The instructions of a 
well-constructed test have been tried out and revised a number of times 
before publication. 

If a large number of students are being tested, a sufficient number of 
proctors should be assigned—in general, 1 for every 20 to 25 students. 
Proctors work most effectively when they have carefully studied the test 
and the test directions. 

As the students are taking the test, the teacher (and proctors) should 
move about the room unobtrusively to make sure that each student is 
recording his answers in the proper way and that he is working on the 
right part of the test. The teacher should avoid moving quickly about the 
room, watching a student over his shoulder, carrying on a conversation 
with a proctor, or doing anything else that will distract the students. The 
teacher should make notations concerning any student behavior that might 
affect the test results, such as undue anxiety, distractibility, dawdling, or 
needing to leave the room. 

Interpretation of results from abilit 


each student has done his best. Henc 
tom 


rewording not 


y tests rests on the assumption that 


r ‹ е, every student should be motivated 
ake his maximum effort. The motivating talk should be brief and to 


the point. Usually it is included in the test manual. At least two important 


Planning and Administering the Evaluation Program 503 


factors affect student motivation in taking tests: (1) the sense of inade- 
quacy that many students experience when they confront items they cannot 
answer, and (2) the sense of indifference that many students exhibit when 
they are confronted with any arduous task that has little meaning for them. 
The test administrator can meet the first problem by explaining that the 
test is designed to measure achievement over a wide range of content and 
difficulty, and will include problems based on material that students have 
not been taught. If the test is closely timed, the examiner can also explain 
that many will not be able to finish their work. The problem of indifference 
can be handled largely through the examiner's attitude of alertness and 
interest in administering the test. An eflective means of combating student 
indifference is to indicate to the students how the test results will be used 
to help them. 

A. comprehensive, well-organized checklist for persons who give stand- 
ardized tests has been prepared by Thompson.* The items in the check- 
list are conveniently organized into four sections: (1) before tests are 
given, (2) during the testing, (3) after the testing period and (4) at all 
times, 

Since standard conditions are so important, а number of schools have 
tried having a well-trained examiner dictate test instructions over the inter- 
communication system to the rooms involved. Such a plan, however, is so 
inflexible that there is no opportunity to adapt to any emergencies that 
might interfere with the progress of a single class. The Los Angeles schools 
have developed a phonograph record of test instructions, to be used with 
the Kuhlmann-Anderson Intelligence Test^ The use of a record would 
provide for somewhat more flexibility. 

The use of sound films, filmstrips, 
administer tests, and the certifying of te c 
courses on the administration of group achievement tests (or a specific 
&roup intelligence test) represent more desirable steps toward approaching 
Standard conditions of administration than the use of impersonal methods. 
For example, a color filmstrip with a s 
Concerned with the administration, scori 
Tests of Basic Skills. 


and tapes in training teachers to 
achers who have taken short 


ynchronized long-playing record is 
ng, and interpretation of the Jowa 


Scoring of Standardized Tests 
nd proper administration of tests is to 


Тће care taken in the selection a 
d. Errors are of two types: (1) 


по avail if the tests are inaccurately score 


5 Anton Thompson “Test-Giver’s Self Inventory," Test Service Bulletin No. 85 


(New York: Harcourt, Brace & World, Inc., n.d.). Copies available on request. 


" Howard A. Bowman, “Assisting Teachers in Test Administration,” 19th Year- 


book, National Council on Measurement in Education (Ames, Iowa: The Council, 
1962), pp. 61-63. 


504 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


constant errors, resulting in scores that are consistently too high or too low, 
due to failure to understand the Scoring instructions; and (2) variable 
errors, caused by carelessness in marking, computing, or copying scores. 

Whenever a teacher is scoring a standardized test for the first time, he 
should score three or four tests and ask to have the scoring checked. Such 
an early check-up may avoid the repetition of errors due to a misunder- 
standing of instructions. Even teachers who are familiar with the scoring 
instructions should have a sampling of tests rescored. If careless errors are 
discovered, all tests for that class should be rescored. 

Accuracy and speed of scoring are usually improved if the teacher scores 
test 1 for all students, test 2 for all students, and the like, instead of scoring 
all the tests in each student's booklet before Scoring the next student's 
booklet. In this way, he handles only one Scoring key at a time and can 
keep in mind any special instructions for Scoring the test. No matter how 
familiar the key becomes, the teacher should never Score from memory. 

A. number of devices have been developed to reduce the burden of test 
scoring. An outstanding development has been the invention of the elec- 
trical test-scoring machine.’ This equipment, now available in most city 
and many county school systems, requires the use of machine: 
tions of the tests and of special answer sheets, Schools not havi 
scoring equipment can hand score the Separate answer sheets 
scoring stencils with holes in the correct answer positions, M; 
lishers provide machine-scoring service on their own tests o 

A number of publishers have given considerable attention 
ing of tests and keys so as to facilitate hand scoring. For example, equated 
scores or grade placements are often printed adjacent to the raw score 
equivalents on the test booklets or answer sheets, Test publishers have also 
prepared tables and other devices to assist in the computation of chrono- 
logical ages from birth dates and intelligence quotients from data on raw 
score and age. When a computational aid that increases the speed of test 


scoring is used, spot checking is necessary to make sure that it is being 
used correctly. 


-scored edi- 
ng machine- 
by means of 
any test pub- 
п a fee basis. 
to the design- 


SUMMARY STATEMENT 


The functions to be served by a school-wide ey 
classified as (1) administrative-supervisory, (2) instr 
functions. In order to serve all these functions effe 


aluation program may be 
uctional, and (3) guidance 
Ctively, an evaluation pro- 


Planning and Administering the Evaluation Program 505 


gram should be comprehensive, continuous, and flexible. It should be developed 
through cooperative planning, be based on the objectives of the educational 
program, and keyed closely to the local situation. Such a program necessarily 
involves a variety of evaluation techniques. 

In planning a school evaluation program, the teaching staff needs to make 
an analysis of the major objectives of the educational program and to survey 
possible techniques and instruments which can be selected or developed to 
measure student progress toward these objectives. In selecting standardized 
tests that can be used to advantage, the relevant criteria formulated in Chapter 
5 need to be applied. 

General recommendations were made concerning the planning of different 
aspects of the school testing program: the selection, scheduling and adminis- 
tration of aptitude, achievement, interest and personality tests. Needs for sup- 
plementary testing for individuals and for specific subgroups should also be 
considered. 

For effective administration of school testing programs, responsibility needs 
to be centralized (both at the school and school system levels) in persons who 
аге well trained for such responsibility. Proper precautions need to be taken 
to make sure that persons administering tests are adequately prepared and 
adhere to standard instructions; that tests are administered under optimum 
Physical conditions; that communication and rapport are sufficiently good that 
examinees are well motivated; and that the scoring of tests is accurately done. 


SELECTED REFERENCES 


CROOK, FRANCES Е., "Elementary School Testing Programs: Problems and Prac- 
tices,” Teachers College Record, vol. 61 (November 1959), pp, 76-85. 

GORDON, LEONARD V., “Right-Handed Answer Sheets and Left-Handed Testees,” 
Educational and Psychological Measurement, vol. 18 (Winter 1958), pp. 
783-785. 

PHILLIPS, BEEMAN N., AND GARRETT R. WEATHERS, “Analysis of Errors Made 
in Scoring Standardized Tests,” Educational and Psychological Measure- 
ment, vol. 18 (Autumn 1958), pp- 563-567. » 

SARASON, SEYMOUR B., "What Research Says about Test Anxiety in Elementary 
School Children,” National Education Association Journal, vol. 48 (No- 
vember 1959), pp. 26-27. 

— ———, AND OTHERS, “A Test Anxiety Sca 
vol. 29 (March 1958), pp. 105-113. 

SMITH, WILLIAM F., AND FREDERICK C. КОСКЕТТ, , 
tion of Anxiety, Instructor and Instructions, 
search, vol. 52 (December 1958), pp- 138-141. | 

SUPER, DONALD E., AND JOHN О. CRITES, Appraising Vocational Fitness. New 
York: Harper & Row, Publishers, Inc., 1962, Chapters 4, 5. . 

тномрвом, ANTON, "Tentative Guidelines for Proper and Improper Practices 
with Standardized Achievement Tests," California Journal of Educational 
Research, vol. 9 (September 1958), pP- 159-166. | 

— "Test-Giver's Self-Inventory,” California Journal of Educational Re- 


search, vol. 7 (March 1956), рр. 67-71. 


le for Children," Child Development, 


“Test Performance as a Func- 
> Journal of Educational Re- 


506 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. List the types of information you would like to have concerning students 
in a class you are to teach. To what extent could such information be obtained 
from standardized tests? 

2. Outline a basic testing program for either an elementary or high school. 
Indicate the types of tests to be used and the frequency and time of adminis- 
tration. 

3. Report on the uses of tests of scholastic a 
tem. Under what circumstances are individual i 

4. List several major objectives for a subject 
Which of these objectives could be effectively 
tests? — 

5. What are the advantages and disadvantages o i i А 
testing programs ше basic skills? In the co nah Een HUE CONUM 

6. Why should the responsibility for test admini i i 
district feel and also within the we ministration be centralized at the 

7. In what ways have the electronic methods of t 
development of testing programs? 


ptitude in a specific school sys- 
ntelligence tests given to pupils? 
area in which you plan to teach. 
measured by written achievement 


est scoring affected the 


Summarizing, Recording, 
16 and Reporting Data 
about Individual Students 


The summarizing and recording of data can be time-consuming processes 
of little significance. On the other hand, they can be carried on in such a 
way that new understandings of the student are achieved: (1) through 
putting together current data on various aspects of behavior and thus per- 
ceiving their interrelationships and (2) through building up a longitudinal 
Picture of a student's development. The process of reporting to students 
and parents can become routine drudgery or can be a valuable and stimu- 
lating process involving a sharing of insights and cooperative planning. In 
Order that all three processes (summarizing, recording, and reporting) can 
become as functional as possible, it is necessary to keep constantly in mind 
that they are means to ends, not ends in themselves. 


SUMMARIZING AND RECORDING DATA 


If Measurement data are to be used to best advantage in achieving a better 
understanding of students and their problems, an adequate, functional 
Cumulative-record system is necessary. The term "cumulative-record 
System" is used to include all forms and procedures involved in maintain- 
ing a continuous record of each student's growth and development. 

In a school system that makes adequate provision for recording per- 
Sonnel data, the cumulative-record system ordinarily consists of a) a 
comprehensive form (known as the cumulative-record form), which in- 
cludes identifying data and information on the student's home environment, 
Scholarship, test scores, attendance, health, and the like; (2) supplementary 
сага forms containing more detailed information on the student's attend- 
ance, health examinations, and the like; and (3) a cumulative-record 


507 


508 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


folder! in which are filed such items as test booklets, anecdotal records, 
and reports on student and parent interviews. 


Purposes of a Cumulative-Record System 


The cumulative-record system serves many purposes for staff members 
who work with the student. The cumulative record provides the official 
record of a student's attendance, achievement, promotions, and graduation. 
It constitutes the basis for his transcript of record when he transfers from 
school to school. For incoming students, as well as those already enrolled, 
the cumulative records provide data that assist in determining each student's 
appropriate assignment to grade level and to specific classes. 

The study of data recorded in a student's cumulative records can also 
help teachers to understand his behavior—to discover special needs, to 
distinguish. between transient and more permanent behavior tendencies, 
to find out when a problem started, and to discover clues concerning causal 
factors underlying a student's difficulties. These purposes are achieved not 
only through the compilation of basic facts recorded on the cumulative- 
record card about the student's health, home situation, learning ability, 
and so forth but also through many informal reports filed in cumulative- 
record folders—observations of behavior, reports of student and parent 
interviews, and autobiographies and other self-expressive materials. 

An adequate cumulative-record system also aids teachers and other 
school personnel in helping parents to achieve a more objective and 
accurate picture of the student's achievements, special abilities, and special 
problems. Kawin emphasizes the great potentialities of using the cumulative- 


record system to help parents to understand their children; yet she warns 
concerning the hazards of such a process. 


Records can be invaluable in home-school contacts if the school personnel 
know how to use its records wisely and constructively in dealing with parents. 
No school device is more effective when wisely used, but no material is more 
likely to antagonize parents if wrongly used. On the whole, no school record 
should just be handed to parents for them to look over. Selected parts of 


records should be shown to parents by a member of the school staff who 15 
competent to interpret this material constructively.2 


1In a large number of school systems, the cumulative-record form is printed on 
a folder in which tests, observational records, and the like can be filed. Thus item 


1 not only serves to provide a summary of the types of data listed but also fulfills 
the function mentioned in item 3. 

2 Ethel Kawin, "Records and Reports; Observations, Tests, and Measurements,” 
Early Childhood Education, 46th Yearbook, National Society for the Study of 
Education (Chicago: University of Chicago Press, 1947), p. 290. 


Reporting Data about Individual Students 509 


Characteristics of a Cumulative-Record System 


Although the specific items desired may vary from one school district 
to another, there are certain general characteristics that are essential in 


àny good cumulative-record system. 


1. A cumulative record (preferably record folder) should be started for each 
student at the time of his entrance to school. 

2. The records should be transferred (at least in summary form) as the student 
progresses from lower to higher schools or moves to another school district. 

3. The cumulative records should present as comprehensive a picture as is 
feasible of the student's growth and development. Provision should be made 
for the cumulation of both test and nontest data. 

4. The forms used should be simple and easily understood. Their maintenance 
Should require no more clerical work than can be justified by the use of data 
recorded.? 

5. The cumulative-record system should be flexible, requiring a minimum of 
data for all students but permitting great latitude in the types of additional 
data that are cumulated for individual students. The use of a record folder 
permits such flexibility. For example, records of interviews with parents or 
previous teachers can be filed in such a folder. | | 

Camden recommends that high school counselors tape-record interviews 
with eighth-grade teachers concerning the students who will soon enter the 
upper secondary school. Flexibility can also be provided by including in 
the cumulative-record form a sufficient amount of blank space for significant 
teacher comments and for summary statements. Still another aid to flexi- 
bility is the provision of space on the cumulative-record form for rating 
special interests and problems as well as for notations on the location of 
significant information that is too voluminous or too confidential for entry 


On the record form. 

6. The cumulative-record system should ђе ‹ 
Over a period of years. This criterion ! 
Objective data as possible. It implies the use of comp: 


designed to reveal trends in growth 
mplies the recording of as much 
arable tests of scholastic 


in machine processing of educational data may 
posting of data on secondary school 
1 Testing Service, working cooperatively 
a, developed a pilot project in the 


* Тһе rapid progress being made 
Soon result in widespread changes in the 
Cumulative records. In 1959 the Educational 


With school and college educators in Georgi ed 
Maintenance of student records, organizing and summarizing data about students 
with the aid of electronic equipment, and reporting such data on comprehensive 
Student report forms. Since the plan is a flexible one that can be adapted to 
Various needs, it attracted the attention of educators in other states. A grant from 
the Ford Foundation early in 1962 enabled the Educational Testing Service to work 
With educators in seven other states in the development of pilot projects similar 
to the one in Georgia. The cooperative Plan for Guidance and Admission now has 
а national advisory committee. Five regional conferences were held during the 
1962-1963 school year to explore the further development of the plan. Annual 
Report, 1961-62 (Princeton, N. J.: Educational Testing Service, 1962), р. 55. 
*Blanche Camden, “For a Better Understanding of Entering Students,” The 


School Review, vol. 61 (January 1953), pp- 40-42. 


510 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


aptitude and achievement as well as comparable measurements of height, 
weight, and other characteristics. Moreover, the record form should be so 
designed that data that are cumulative can be presented in chronological 
sequence. All entries should be dated. Data regarding the student's personal- 
social adjustment (gleaned through observation of behavior, conferences, 
study of creative writing, and the like) should be summarized at the end 
of each school year to reveal evidences of growth, as well as special needs 
and problems. At the time such a summary is made, much of the supporting 
data may be discarded. However, unusually significant materials (such as 
case studies, records of parent conferences, student autobiographies, and 
profiles of recent tests) should be retained in their original form. 


. Cumulative records should be readily accessible to teachers. However, the 


confidential nature of the data must be respected, and the records always 
kept in locked files. Some material in a student's record may be so highly 
confidential that it should be kept in a special file directly accessible only to 
administrative and guidance personnel. In such circumstances the entry 
*See confidential file" can direct the teacher to the administrator or guid- 
ance worker for an interpretation of the material. 


. Cumulative records should be so maintained that the data are accurate, 


complete, and up to date. Data on a student's test results are incomplete 
unless the following information is given: 
1. Complete name and identification of test—for example, New Standard 
Achievement Test, Form A, Intermediate Level 


2. Date on which the test was administered—for example, January 12, 
1964 ý 


3. By whom the test was administered 
4. Type of score. In all cases in which a standardized t 


ids e of score has 
been adopted, abbreviations are adequate to identif a 


y the type of score.” 


. In the recording of data, every attempt should be made to distinguish facts 


from personal opinions. As cumulative records are expanded to include data 
obtained through informal evaluation techniques, it is important that 
teachers distinguish between objective facts and subjective impressions. The 


importance of making this distinction in anecdotal records was emphasize 
in Chapter 8. 


If the data in cumulative-record folders are to be of optimal value they 


should be (1) organized for ready accessibility, and (2) culled out peri- 
odically, the older and less valuable material being discarded after a sum- 


mary is made. As a basis for organizing materials in the record folder, the 
following plan is suggested: 


1. 


Prepare and mimeograph for school-wide use а simple summary sheet to be 
stapled inside each student's folder. As a basis for раан + а entries in 
the record folders, a grid similar to the one in Figure 16.1 Е desirable. 


Areas to be listed in the left-hand column can be established through group 
planning. gn g 


5 Adapted from Handbook of Cumulative Records, A Report of the National 


Committee on Cumulative Records, United States Office of Education, Bulletin 1944, 
No. 5 (Washington, D. C.: Government Printing Office, 1944) : P 


Reporting Data about Individual Students 511 


Area of Information 


. Intelligence 


2. Health 


3. Personal- Social Adjustment 


4. Environment 


5. Achievement 
5A. Bosic Skills 


5B. Content Subjects 
5С, Work- Study Hobits and Skills 


6. Special Interests and Talents 


T. Social Attitudes 


Fig. 16.1 Suggested Summary Sheet for a Cumulative-Record 
Folder. 


N 


+ Number serially all materials filed in a student's folder at the time they are 
filed (that is, the first material filed would be entry 1; the next, entry 2; 
and the like.)" These numbers should be written in red at the upper-right- 
hand corner of each test booklet or other item filed. 

3. At the time an entry is filed and numbered, its number should be entered 
for the proper grade level opposite the "Area of Information" to which it 
Pertains. In some cases, more than one such entry may be made. A report 
Of a home interview, for example, might be entered not only under area 4 
but under other areas in which significant information was provided (for 
example, area 3 or 6). A report card covering all aspects of a student's 
achievement would be listed under area 5, whereas a test limited to the 
basic skills would be entered under 5a. An anecdotal record might reveal 
Significant information in both areas 3 and 7. 

4. If the evidence filed reveals a special problem, 

cled. 


the number could be encir- 


The plan suggested above is only illustrative. The summary sheet used 


and the symbols employed should be developed locally and modified in 

the light of local experience. 
Time should be provided f 

Ог guidance teachers at the h 


or elementary school teachers, and homeroom 
ich school level, to make a careful study of 
о 


a student's folder, a blank sheet should 


^1f material is temporarily taken out of 
= p : —date) so that anyone using the 


be inserted in its place (entry 14—Miss Smith: 
folder would know where the missing material can be found. 


512 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


the cumulative-record folders of incoming students before school opens. 
ТЕ such time is not provided, teachers should make such a systematic study 
immediately after the opening of school. Such early study will enable the 
teacher to obtain a picture of the average level and the variability of his class 
with respect to scholastic aptitude and various aspects of achievement. It will 
also enable him to identify students in need of corrective instruction or 
other individual attention. On the basis of these data, the teacher will be 
able to plan for needed observational records and for student and parent 
interviews so as to obtain additional data early in the year for those who 
need assistance with special problems. 

Procedures should be developed for disseminating to high school teachers 
data about the aptitudes of all their students and their achievement in 
the basic and work-study skills. Relevant data from standardized tests, as 
well as significant nontest data, could be distributed through the student 
if the data are reproduced on punch cards and not "interpreted" or printed. 
That is, each high school student might receive at registration time a pack 
of six or seven data cards (on which data were punched but not printed). 
The teacher of each class would collect cards from students each period. 
Each teacher could then turn in the cards for all of his students grouped 
by period. The teacher's code number and the appropriate period number 
could be gang-punched on the cards. Then, the names of students, accom- 


panied by significant test and nontest data, could easily be listed by 
machine on roll-book sheets for each teacher's classes. 


REPORTING DATA TO STUDENTS AND PARENTS 


Few issues have occasioned more discussion among parents and teachers 
during the past decade than those concerned with assignment of marks, 
report cards, and the entire process of reporting to parents, Much of the 
controversy has probably grown out of differences in the em 
on various functions of the report card by 
workers, and administrators. 


phasis placed 
parents, teachers, guidance 


Functions of a Program of Reporting 


The major functions of any reporting plan include the following: 


1, Administrative functions—to 
and graduation. 

2. Guidance functions—to identify areas of special strength and weakness 
as a basis for realistic self-appraisal and future educational and vocational 
planning. 


- Motivational functions—to stimulate students to increased effort in order tO 
earn good marks. 


provide data for use in promotion, transfer, 


Reporting Data about Individual Students 513 


4. Informational functions—to inform the student and his parents concerning 
his progress toward the goals of the educational program as a basis for co- 


operative planning. 


Teachers and administrators who place great emphasis on the first two 
functions tend to favor letter or number grades, which are easily recorded 
and have the appearance of being objective and comparable. These edu- 
cators tend also to favor marking a student in terms of his relative status 
in comparison with other students. Unless this is done, they believe, the 
cumulative record of teacher's marks will provide no accurate basis for 
tudent's achievement has been or whether he should take 
certain high school subjects, such as advanced mathematics and science. 
Of the functions listed above, it is the first one that seems to be best 
served by the traditional report card. A single letter or number does provide 
à convenient, easily recorded symbol of the teacher's judgment concerning 
а student's work, Such symbols can be used to compute grade-point- 
averages and rank in graduating class. A cumulated record of such marks 
can be photographed to provide transcripts for other schools. | | 
Use of students’ marks to serve the second function (that is, to identify 
Special strengths and weaknesses as a basis for educational and vocational 
planning) involves the assumption that grades are comparable. The way in 
Which grading standards vary from teacher to teacher minimizes the value 
Of grades for this second function. However, over-all high school grade 
average does have predictive value for academic success in college. In 
fact, high school grade-point-average tends to correlate as high with college 
grades as student scores on a long achievement test battery (such as the 
lowa Tests of Educational Development). Moreover, student grades pro- 
Vide the only reported evidence of students’ strengths and limitations in 
many subject areas in which standardized tests are not available, such as 


art, music, dramatics, and other fields. 
Many parents, and also a number of de 
function of marking and reporting. They tend to emphasize the need for 
а competitive marking system, which requires that a student be marked in 
terms of how his achievement compares with that of his classmates. On 
the other hand, many educators stress the disadvantages of competitive 
Marking, for example, the discouragement of the dull student who does 


his best, the stimulus to cheating and to superficial achievement rather 
than genuine growth, the effects on parent-child relationships, and the like. 


Many educators believe also that dependence should not be placed on 
the motivating value of marks—that “а course in which the mark is the 
Major stimulus for the student to work should be discarded or subjected 
to extensive revision." Immediate information concerning specific successes 


knowing what the s 


teachers, stress the motivational 


т William A. Wrinkle, Improving Marking and Reporting Practices (New York: 


Holt, Rinehart and Winston, Inc., 1947). 


514 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


or failures, such as can be provided through programmed instruction and 
evaluation of projects completed, may provide a sounder basis for motiva- 
tion. If the student has many short tests or obtains (through these tests 
and through other means) many cues concerning his successes and failures, 
periodic. summary marks may be unnecessary from the motivational 
point of view. . | | 

Elsbree points out that the fourth function—that of informing students 
and parents—is the primary function of any reporting system and that all 
other functions are incidental to it5 A single letter or number grade is 
especially inadequate for the fourth function, that of providing information 
on each student's progress as a basis for cooperative planning with students 
and parents. Most of the modifications that have been introduced into 


school reporting systems have been designed to improve their value 
in communication. 


Types of Reporting Practices 


Wrinkle, who has experimented for more than a decade with different 
reporting practices at the high school level, states 
the traditional marking system (percentage or letter 
of (1) manipulating the symbols (for example, 
symbols as "S" and “N” for "Satisfactory," 
(2) supplementing the symbols. 
is widely used in the primary gra 


that departures from 
marks) consist either 
by changing to such 
"Needs improvement") or 
One purpose of the first approach, which 


des, is to reduce emphasis on competitive- 
ness among students and to reduce parental pressures upon the child. 


Such adaptations, however, communicate less information about the child. 
If they are supplemented by individual parent-teacher conferences, how- 
ever, they provide a satisfactory solution for younger children. 

The second approach has involved three techniques: (1) the develop- 
ment of fairly detailed rating scales of major and minor objectives on 
which the teacher rates the student, (2) the use or informal letters, and 
(3) the parent-teacher conference. 


THE RATING SCALE Through the use of ratin 
parents a more detailed picture of what the sch 
in each major area and the student's relative 
within each area. Since even the simplest ratin 
of items, the teacher can be expected only to 
respect to student performance on each ite 
advantage of being a simple way of reportin 
with a minimum of time and effort. The da 
permanent form on the student's cumulative г, 


g scales, teachers can give 
ool is trying to accomplish 
Strengths and weaknesses 
& scale includes a number 
make gross distinctions with 
m. The rating scale has the 
5 à great deal of information 


ta are also easily recorded in 
ecord, 
8 Willard S. Elsbree, Pupil Progress in the Elementary School. (New York: 
Bureau of Publications, Teachers College, Columbia University, 1943) pp. 72-73- 
2 Wrinkle, op. cit., р. 50. ' | 


Reporting Data about Individual Students 515 


i i the care 

The effectiveness of this type of reporting depends upon iu "€ 
with which the list of objectives is developed and the o jec cna r 
defined, (2) the extent to which parents and students are hie e b 
developing the statement of objectives, (3) the wand 8 pg 
techniques for evaluating progress toward ~ је а : ыл 

1 1 ап 
teacher uses in the actual rating process, 1 t i 
Students are involved in self-appraisal of their own орле pun edi 
of their own ratings with the teacher. A rating scale ER е 
to those aspects of student performance on which the teacher ca 
ave adequate evidence. 

к: ap e aim : E ~ ће Pasadena City Schools Progress Report, 

Тће following excerpts from the ; Mc diem Ge 
grades | and 2, illustrate the way in which a major heading E | 
ede of wO nag " “N” is given) may be followed by a checklist о 

i , , 
behaviors. 


Pasadena City Schools Progress Report! 


2d report 3dreport | 4th report 
5 5+ 5+ 
READING 


Reads with understanding 
Shows interest in reading 
Works out new words 
Reads with fluency 


HANDWRITING 


Forms letters correctly 
Is neat 


PHYSICAL 


Is developing coordination 
Participates in group games 
Uses equipment properly — 
Exhibits good sportsmanship 
Demonstrates skill in rhythms 


Work 


AND STUDY HABITS 
Makes good use of time 
Follows directions 
Makes satisfactory effort 
Yorks independently 
Istens attentively 
oes neat work 
Uses materials wisely 


5 О anding, S—at grade 
follows: O—Outstanding, 
ans 0 may be used after any subhead 
first report each year is a teacher-parent 
he report card is the 2d report. Reprinted with the 
e 


19 These grade symbols are interpreted z 
level, N- below grade level. A plus or mi 
to indicate strength or weakness. Since the 
Conference, the first entry on t 
. no ools. 
Permission of the Pasadena, California, City Sch 


516 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


THE INFORMAL LETTER The informal letter has many advantages as à 
medium for reporting to parents. The letter can be individualized to high- 
light the special strengths and needs of an individual student. It can ђе 
highly analytical in those areas of the student's development in which 
specific problems are being met. A carbon copy of the letter constitutes a 
permanent record that should be filed for use by later teachers. The use 
of the letter form stimulates many parents to reply. 

The informal letter, however, has certain limitations. Many teachers 
do not express themselves easily and effectively in writing. Even with 
teachers who do write well, there is greater possibility of misinterpretation 
by the parent than in a conference. It is difficult to report student weak- 
nesses tactfully without minimizing them unduly. Because of these dif- 
ficulties, informal letters frequently deteriorate into stereotyped reports 
of little interest and value. When this deterioration occurs, the values of 
the letter report fail to justify the large amount of time required. 


THE TEACHER-PARENT CONFERENCE There is little disagreement on 
the value of the teacher-parent conference as a technique for communica- 
tion with parents. Like the letter, the conference can be individualized to 
focus attention on those aspects of achievement and those problems that 
are most important to the individual. Through a conference, a variety of 
data and their interrelationships can be interpreted. The possibilities 
of misunderstanding are diminished. The parent has the opportunity to 
present his questions and problems. The teacher obtains information of 
value concerning the student; and, perhaps most important, a good confer- 
ence leads to cooperative planning by teachers and parents, 

An eflective program of teacher-parent conferences requires teacher 
education concerning effective preparation and interview techniques, as 
well as some released time for conferences and for making adequate writ- 
ten records for future use. Many school districts Operate on a shorter school 
day during two or three weeks of the time that conferences are scheduled. 

Certainly the conference method cannot constitute the sole method of 
reporting unless adequate written Teports are made, Even if such reports 
are made, however, there still remains the problem of marks or grades for 
students transferring to other schools, as well as the highly significant 
problem of reporting to the student. At Present, almost all schools retain 


some type of letter-grade reporting at intervals throughout the year. These 
intervals vary from six weeks to an entire semester, 


In recent years, school districts have been e 
pretation of test data to parents through thes 


lation and court rulings in some states (for e 
York) have required th 


results of standardized t 
fully concerned about 


xperimenting with the inter- 
€ conferences, In fact, legis- 
xample, California and New 
at parents be given information concerning the 
ests if they so request. Since educators are right- 
misinterpretation of such data and the possibility 


Reporting Data about Individual Students 517 


of harmful pressures on children, considerable attention is now being given 
to developing optimum procedures for the interpretation of test results to 
parents. In publications concerned with this problem, a number of helpful 
Suggestions have been made: 


1. One should avoid communicating intelligence quotients or other numerical 
Scores; it is better to use an expectancy chart (or a verbal interpretation of 
one), to interpret the score in terms of its relationships to other significant 
variables. Intelligence should be interpreted as developed scholastic aptitude, 
rather than innate mental ability. 

If scores are requested, probably stanine scores are the most suitable be- 

cause they minimize the risk of the parent's attaching too much significance 

to small differences in raw scores. 

3. The use of grade equivalents is to be avoided because of (a) their inade- 
quacies in showing relative strengths and weaknesses (as explained in 
Chapter 2) and (b) the faulty inferences likely to be drawn about the 
student's competency to do work of a higher grade level. 

4. The use of some device, such as the percentile band, which reminds the 


reader of the error of measurement, is desirable. 


~ 


Ebel indicates one of the problems faced when IO's are reported to 
parents, 


One principal in a midwestern school ran into trouble with the parents of 
а 10-year-old who had an IQ of 110. A neighbor's daughter, who was only 
Sight, had a measured IQ of better than 130. 
"What do you mean,” the irate parents demanded. “Our son is smarter than 
that eight-year-old. He’s a better speller. He can do long division and fractions 
and she can't, But she got a higher score than he did. If she’s so smart, why 


isn’t she in the fifth grade, too?" 


In the Winchester (Massachusetts) schools, stanine scores for as many 
as eight subtests of an achievement battery can be graphed on a form like 
that in Figure 16.2. The grid is nine blocks wide to correspond with the 
Nine segments of the stanine scale. No explanation of stanine is printed 
ба thie chart aird по numbers are teed, А верно! charts Sue used so that 
the teacher can select for each child the one that has the right shading for 
“Ability.” In the example, the child’s stanine for scholastic aptitude is 4; 
hence a chart is chosen on which blocks 3, 4; and 5 are shaded. In record- 
ing the data for each subtest, the teacher draws a red line starting at the 
left and going through the block representing the stanine equivalent of 
the student’s raw score on that test. Because of the error of measurement 
involved, a student is considered to be achieving reasonably well if his 
achievement stanine on each test falls within the shaded area. 


dized Test Scores to Your Parents," 


1 Robert L. Ebel, "How to Explain Standar = 


chool Management, vol. 5 (March 1961), pP- 6l 


518 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


SCHOLASTIC ABILITY — ACHIEVEMENT CHART 
AREA TESTED HIGH 


Ability 72 
English H-E 
Mathematics | | ДР, 


Science 
Social Studies 


4 4 
LZ 2 
Fig. 16.2 Form Used for Reporting Test Data to Parents in the 
Winchester (Massachusetts) Public Schools. 


A 
C 
H 
| 
Е 
V 
E 
M 
E 
N 
T 


From Norton E. Demsey, Jr., "Reporting Test Results to Parents,” The 19th Yearbook, 
National Council on Measurement in Education (Ames, lowa: The Council 1962), рр. 
64-66. 


Although the Winchester schools send this report home with the last 
regular report card, the advantages of using such a form at a teacher- 
parent conference in the fall should be seriously considered. On the whole, 
discussion of the chart at the conference and refiling it in the student's 
folder seems advisable, rather than sending it home, possibly to be mis- 
interpreted to the other parent and to the child. 

The following caution regarding interpretation should be communicated 
to parents either orally or in writing when test data are examined in à 
teacher-parent conference. 


Parents should understand that the testing process is subject to a certain 
margin of error, and that the relationships depicted thereon are based on опе 
test of ability as compared with one test of achievement in each of the areas 
listed. Thus, the picture shown cannot be considered precise and it is intended 
only as a guide to the child's recent achievement in relation to his potential.” 


It may be advisable to average all available stanine scores from scholastic 
aptitude tests and reading tests so as to obtain a more reliable index of 
each student's ability to do academic work, 


isNorton E. Demsey, Jr., "Reporting Test Results to Parents,” 19th Yearbooks 
National Council on Measurement in Education (Ames, Iowa: The Council, 1962)» 
p. 65. 


Reporting Data about Individual Students 519 


In Chapter 8, techniques of interviewing were discussed. Some addi- 
tional hints that apply more specifically to conferences regarding the 
interpretation of test results are presented by Sax: 


І. Try to find out the parent's reactions to the child's progress in school by 
such questions as: "How are things progressing?" "How do you feel we 
can be of help?" A discussion of these questions provides the teacher with 
an opportunity to describe test results as they relate to parent concerns and 
involves the parent more actively in the study of the data, —. . 
Check the cumulative record folder for notations concerning which test 
results have already been presented to the parent. Such care will minimize 
the risk of needless repetition or of contradicting the picture previously 
presented by a colleague. 2. | 
3. Where the evidence appears contradictory or otherwise inadequate, be will- 
ing to admit the inadequacies of tests and discuss the reasons for variations 
in results. | 
4. Try to gear the type, amount, and complexity of the material presented to 
the parent's apparent ability to understand and utilize. the results. Stress 
those aspects of the data which seem most relevant to "next steps for the 
student in terms of remedial work, or other enrichment plans. Avoid tech- 
nical terms, but do not over-simplify or “talk down” to parents. 
5. Before closing the interview, it may be advisable to ask the parent to sum- 
marize his interpretation of the test data so that any miscoHcepHons can be 
Corrected and misunderstandings reduced to a minimum." 


N 


Developing Reporting Procedures for Local Use 


eds to be understood and approved 


Since a sati eporting plan пе : 
in парна и A should be developed cooperatively, 


by the teachers, students, and parents, it | " ting 
With all these groups having representation. Any changes in reporting 


Practice should be preceded by a carefully planned жуте of a id 
tation to students and their parents so that there will be adequate un a 
Standing of the changes and why they are being made. T plan 
Should be consistent with the philosophy of the school. It т e com- 
Prehensive, including all the major objectives of the educationa pue 

The forms and procedures used in reporting ip prts Ванче шау 
the channel of communication. The effectiveness of a reporting plan 
depends largely upon what is communicated—that 15, н E е 
of the teacher’s appraisal of student growth and the skill with which he 
Involve i evaluative process. mm 

Тће ae ате Зо: Бе F sed should be ти E Vm of the 
Telative importance of the different functions of жы № = п v a 
Siderably from the primary grades through ы ertainly e 2 : 
mentary grades, for example, the informational function is crucial. Bo 


n and Analysis of Educational and 


12 Adapted from Gilbert Sax, The Constructio: Wie: College PANUNG aid 


Psychological Tests: А Laboratory Manual (Madison, 
Typing Company, 1962), рр: 66–67. 


520 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


teachers and parents have much information of value to communicate to 
each other through such a technique as the teacher-parent conference. At 
the elementary school level, the teacher should certainly not have to place 
much reliance on the motivating function of marks; and their use in the 
administrative and guidance functions is usually one of supplementing 
data from a systematic program of standardized testing in the basic skills. 
At higher levels, reliance on letter grades becomes increasingly appropriate 
as the student goes from junior high through senior high and college. The 
teacher increasingly bases his summary grades on limited samples of 
academic work; hence, he may have insufficient information to justify the 
writing of informal letters or holding conferences with all parents. Also 
the higher the grade level, the greater the importance of the first two func- 
tions, which depend on data that is either in numeric form or can readily 
be translated into such form. 

Perhaps no plan of reporting would be quite so effective as periodic 
teacher-parent and teacher-student conferences. However, such a plan 
becomes impractical at the high school level unless the number of teacher- 
student contacts is materially decreased or unless allowance is made for 
such conferences in the work load of homeroom, core-class, or guidance 
teachers. In general, Wrinkle's experience has led him to favor the check- 
list, or rating scale, as a shorthand method of communicating a maximum 
amount of information to high school students and their parents with a 
given amount of time and effort and as a means of providing data that 
can be easily summarized and recorded." 

One of the essential characteristics of a good reporting plan, especially 
at the secondary school level, is flexibility. That is, the plan should per- 
mit core teachers or guidance teachers to report in some detail on the 
achievement and behavior of students, and yet be so designed that а 
teacher who has almost two hundred students (for example, in typewriting 
or physical education) is not obligated to check routinely many aspects 
of student behavior that he has had no opportunity to observe or record. The 
growth report used in the Pasadena junior high schools is designed to allow 
for such variations in practice from teacher to teacher. 

Under this plan, each student receives a subject grade and a citizenship 
grade in each subject. Since both marks are recorded on the cumulative 
record, they are both considered important by students and teachers. 
Students realize that many employers will be just as interested in their 
citizenship record as in their scholarship record. Each of these two grades 
is defined in terms of several subheadings simply worded so that they аге 
meaningful to students and parents. For example, the subheadings under 
the citizenship grade are: (1) Responsibility, (2) Effort, (3) Participa- 


14 Wrinkle, loc cit. 


Reporting Data about Individual Students 521 


tion, (4) Class Conduct, and (5) Courtesy. Those teachers who so desire 
may place a "plus" or "minus" symbol following any subheading to indi- 
cate a strength or weakness for that student. Ample space for dated com- 
ments by teachers and parents is provided on the back of the report form. 

Not only does this report allow for variations from teacher to teacher 
but it permits a teacher to report much more fully on certain students 
than on others. Such flexibility encourages the teacher to do as adequate 
à job of reporting as time permits without forcing him into a pattern of 
routine checking for large numbers of traits. 


IMPROVING THE VALIDITY, RELIABILITY, AND 
COMPARABILTY OF TEACHERS" MARKS 


s are almost essential to the administrative func- 
the guidance functions moderately well if 
e from teacher to teacher. It is important, 
to principles and procedures that 
comparability of teachers’ marks. 


It seems that letter mark 
tions, and that they might serve 
they were more nearly comparabl 
therefore, that we direct our attention 
might increase the validity, reliability, and | 

As with achievement tests, our major concern with teachers" marks is 
that they have content validity, that the evidence used in grading be ade- 
quately representative of the objectives and content of the course. We 
are also concerned with the reliability of grades, that is, that the sampling 
of evidence be large and that subjectivity of judgment be minimized 
as much as is feasible in assigning scores to the evidence utilized as a 


basis f i 

asis for grading. vds or 

When we candide how we can improve the validity, reliability, and 
ognize that three essentially 


Comparability of teachers’ final grades, we rec 
different problems are involved: 


1. How to improve the validity and reliability of the sampling of evidence used 


as the basis for grading. i 
Assistance a this p of the problem has Ld ern ue gis 
textbook as we have considered how to improve the co А 


ing of teacher-made tests, how to use standardized tost results as an aid in 
adi г d processes, . . 
grading, how to rate products an р Pte for а semester or year into 


2. i i i umulated 
How to weight and combine the с ains procedures that (2) reflect 


a single composite score for each student, USIP Pris 
the Чеге einphasis on different types of er LORS e 
reproducible, and easily explained to students an Tw d the teacher and t 
be used to reward those who have especially please * e е ег un о 
punish those who have not. The procedures TA ан ах 
Tollbook data should be public information. i 

- How to bok the иа of composite scores for ey "pope of 
A, B, C, D, and F grades. ]t is with respect ја а 


522 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


that we encounter such problems as (a) improving the comparability of 
grades for multiple sections of the same course; (b) improving the compara- 
bility of grades from one subject to another; and (c) improving the com- 
parability of grades from school to school. 


Improving the Validity and Reliability of the Sampling of Evidence 
Used in Grading 


If the sampling of evidence cumulated in the teacher's rollbook is to 
constitute a valid basis for marking, the evidence should represent all the 
major goals of the instructional program. Moreover, the evidence basic 
to the subject grade should be limited to information regarding the stu- 
dent's competency with respect to these major goals, uncontaminated by 
extraneous factors. 

Grades recorded on tests or products should be as free as possible of 
factors extraneous to student achievement. Tests and homework should be 
scored as objectively as possible; grades on student knowledge in different 
areas should not be contaminated by such factors g 
or handwriting. Assignments or tests, on which subjective scoring is essen- 
tial, should be scored by methods that reduce “halo effect.” Students 
should not receive spuriously high marks for reports that have been 
“dressed up”; clear communication, legibility, and neatness should be suf- 
ficient. Nor should students be rewarded for writing reports that are 
unusually long. The teacher should suggest an acceptable range in length 
and indicate that writing a longer report will not contribute to a higher 
grade. 

In order to improve reliability in marking, a large sampling of evidence 
should be obtained; objective scoring should be used whenever it can ђе 
used without reducing relevance; the raw scores on short quizzes should 
be recorded so that evaluative judgments can be based on combined raw 
Scores; tests should be of optimum difficulty so as to differentiate among 
students of different levels of competency. 


as spelling, neatness, 


Weighting and Combining the Data into Composite Scores 


If students are to be fairly graded and to be motivated by knowing 
about their progress to date, the bases for combining data from tests, term 
papers, and other types of evidence should be clearly defined and reported 
to the class. Insofar as possible, the student should be able to reproduce 
this combining process and estimate his grade. It is true that for students 
near the borderline between one grade and another, the student’s score 
on a final examination, and occasionally subjective judgment, must be the 


Reporting Data about Individual Students 523 


decisive factor. If the latter is the case, the teacher should be able to tell 
the student what factors were considered in making that judgment. For 
example, for students making identical composite scores at the borderline 
between A and B, the teacher might decide to give an A to the student 
Whose work had improved during the last few months of the year as 
compared to one whose work had retrogressed. 

The use of stanine scores on essays and other products will force the 
teacher to differentiate among levels of competency and will improve the com- 
parability of marks from one marking occasion to another.** Or the teacher 
сап set up any other system of converted scores, provided that it can be 
explained to students and is consistently followed. Numerical scores are 
more easily recorded in rollbooks than A, В+, and the like; moreover, 
they are more easily totalled at time of summarization. Since students like 
to receive A's or B's, however, the teacher can present his own interpreta- 
tion of the numerical scores. The author has used the following: 


н Мо А tà Os -1 со о 
"oc o ы > 


The students are instructed to interpret a score of 8 as "between an A and 
аВ”; д score of 6 as “between a B and a C," and the like. Students can 
вазу keep a cumulative record of their own grades if they wish. 

If all rollbook data are recorded in terms of comparable scores, and marks 
°F examinations or projects that should receive double weight are recorded 
twice, all the teacher needs to do at grading time 15 to total all rollbook 
entries for each student. Such a total gives the composite score for each 
Student, Or if missing entries due to absences are a problem, ons might 
Check off the highest scores for each student until a median or midscore 
15 reached, 


If the teacher does not use stanines or some a 
Score in the rollbook, he cannot easily weight different types of evidence 


other type of standard 


19 ТЕ the teacher finds it difficult to use the standard percentages in grading all 


assignments he could make sure that approximately one-fourth of the students re- 

Ceive high staiie scores (9, 8, or 7); that approximately one-half receive the three 
: ‚8, ; s А 

Middle Stanine scores (6, 5, or 4); and that approximately one-fourth receive the 


OW eins 
W stanine scores (3, 2, and 1). 


524 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


as desired. The types of scores that have the highst SD’s will receive the 
greatest weight in any composite score.'^ 


DIVIDING THE DISTRIBUTION OF COMPOSITE SCORES FOR THE ASSIGN- 
MENT OF MARKS Once composite scores have been obtained, the teacher 
faces the decision of how generous to be in counting off the number of 
A’s, B's, and other marks. It is at this step in the decision process that high 
school administrators and counselors would like to intervene to improve 
the comparability of grades. Grades would be more meaningful to all who 
use them and would have higher predictive validity if greater comparability 
in grading practices could be achieved. 

In Sweden, comparability on a national scale is achieved by adminis- 
tering a common achievement examination to all students in a subject field 
and then informing each school concerning the number of their students 
who scored well enough on the test to be entitled to the highest mark and 
each of the other marks. The centralized examination does not determine 
which students will receive each mark but how many of each mark are 
available for assignment. In this way, the teachers retain full responsi- 
bility for the assignment of marks; they can take into account additional 
evidence and weight each type of evidence as desired. Centralization of 
decisions concerning the number of A’s, B’s, and the like to be assigned 
ensures that grades have high comparability from school to school. 

Centralized control of marking, however, may have disadvantages. The 
common test, unless it is very carefully constructed, may not fairly терге“ 
sent the major goals of instruction; some schools might be allowed to give 
more high grades because their curricular emphases approximated closely 
those of the common examination. Negative effects on students" motivation 
and their self-concepts in schools with larger percentages of slow-learning 
pupils seem inevitable. Since American schools and colleges use standard- 
ized tests as aids in admission and classification, there is less need in the 
United States than in many other countries to ensure comparability of 
grades from school to school. 

In an effort to increase comparability of grades within the school, many 


16 For example, a teacher may tell students that the greatest weight will be given 
to homework in determining the final grade. However, if there is a relatively little 
variation in grades assigned to homework and much greater variation in grades on 
periodic examinations, the examinations will automatically contribute more to the 
final mark if any objective method of combining data is used. The effective weight 
of any component increases with its variability. If stanine scores or T-scores are 
used for all variables to be combined, this problem can be ignored for the variability 
for all components in the grading composite will be the same. If comparable scores 
are not used, the desired weight for each type of evidence (homework, term paper: 
examinations, and the like) must be divided by the SD of scores for that type of 
evidence before the data are combined to obtain standard scores. 


Reporting Data about Individual Students 525 


Schools and colleges establish school-wide grading policies. If we wished 
to help teachers to modify the school grading policy for various sections 
of English or some other multiple-section course, an "anchor test," such 
às that described in Chapter 13, might be developed. Teachers might 
then be informed concerning how each of their classes performed on 
this anchor test.'* The distribution of students’ scores on the anchor test, 
however, would not determine the number of A's and other grades avail- 
able but would merely help teachers to modify the school grading policy 
for specific classes on an objective basis. 

If teachers did not wish to use an anchor test, the school grading policy 
might be adapted to specific classes in terms of how well the students 
achieved on some test that was highly correlated with achievement in the 
Subject field. In measurement terms, we would select a test that would 
have high concurrent validity, that is, high correlation with the criterion 
Of over-all achievement in the course. Scannell has recommended the 
Use of such a test as the Cooperative English Test in English classes and 
Other relevant tests for other departments. 

Let us consider how Scannell’s proposal might work. Let us assume that 
all ninth-grade students have taken a local arithmetic test and a scholastic 
aptitude test with verbal and numerical subtests. Let us assume that 
T-scores on the arithmetic test and the numerical section of the scholastic 
aptitude tests have been used by counselors to aid them in programming 
eighth-grade students into basic mathematics (a кесай course) general 
mathematics, and algebra. The distributions of student's average T-scores 
in these two tests are shown in Table 16.1. Let us assume that the grading 
Policy for the school is to assign approximately 20 percent A's, 30 percent 

's, 40 percent C's, and 10 percent D's and Ез. We have computed Ра, 
Ps», and Р, for the algebra classes, the nonalgebra classes, and all classes 
Combined since these are the division points that divide each group into 
the recommended percentages, that is, highest 20 percent, next highest 30 
Percent, and the like. . 2. А 

Routine application of this grading policy in all classes is indefensible 
Jecause of differences in their achievement. Such questions as the follow- 
11р cannot be avoided: Is a specific class representative of the general 
School population? If a student is assigned to an honors section, should 

© be penalized by the fact that his competency 55 below average in that 
Class, although it would be well above average in a typical class in the 


hor test as part of his final examination, 
a number of difficult items of his own 
w-learning group the anchor test could 


'* If the teacher were using such an anc 
© could supplement the anchor test with 
"election for accelerated classes. For a slo 


© supplemented by a number of relatively easy items. А Mi 
18 Dale р, osse *Making Grades Meaningful: A Proposal,” The University 


9f Kansas Bulletin of Education, vol. 15 (November 1960), pp. 26-35. 


526 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


Table 16.1 
Frequency Distribution of Ninth-grade Mathematics Students with 
Respect to Scores Used as a Partial Basis for Grouping 


AVERAGE T-SCALED 


SCORES ON 
ARITHMETIC TEST Number of Cases in 
AND NUM ERICAL TOTAL 
SECTION OF TOTAL оти 
SCHOLASTIC BASIC GENERAL NON- REGULAR HONORS TOTAL GRADE 
APTITUDE TEST MATH. MATH. ALGEBRA ALGEBRA ALGEBRA ALGEBRA CLASS 


70 and above 1 9 10 10 
69 2 2 2 
68 1 2 3 3 
67 5 5 5 
66 1 1 2 5 7 8 
65 3 4 7 T 
64 1 1 2 3 5 6 
63 2 2 4 4 8 10 
62 1 1 2 3 2 5 7 
61 3 3 4 4 7 
60 1 3 4 2 4 6 10 
59 3 3 6 3 9 12 
58 4 4 8 2 10 14 
57 1 2 3 10 1 11 14 
56 1 1 16 16 17 
55 2 2 16 16 18 
54 1 1 16 16 17 
53 1 3 4 15 15 19 
52 4 4 15 15 19 
51 2 9 di 7 7 18 
50 1 12 13 7 7 20 
49 2 15 17 5 5 22 
48 4 13 17 5 5 22 
47 3 15 18 2 2 20 
46 4 12 16 2 2 18 
45 5 13 18 18 
44 6 10 16 1 1 17 
43 T 10 17 1 1 18 
42 8 7 15 15 
41 7 6 13 13 
40 5 7 12 12 
39 6 5 11 11 
38 5 5 10 10 
37 6 3 9 9 
36 5 4 9 9 
35 6 2 8 8 
34 5 3 8 8 
33 4 2 6 6 


Reporting Data about Individual Students 527 


AVERAGE T-SCALED 


SEORBS/ON Number of Cases in 
ARITHMETIC TEST XE 
AND NUMERICAL p 
SECTION OF TOTAL 
SCHOLASTIC BASIC GENERAL NON- REGULAR HONORS TOTAL GRADE 
APTITUDE TEST MATH. MATH. ALGEBRA ALGEBRA ALGEBRA ALGEBRA CLASS 
EMI M 
32 2 2 4 : 
31 3 1 4 : 
30 1 1 2 11 
Below 30 9 2 H 
Total number of 
Cases " 110 190 300 150 50 200 ini 
" 49 63 58 
ne 44 56 49 
У м н 
Ninth grade as 
reference group 
A's (Percent at 
98 20 
Ра and above) 2 9 2d 
'8 (Percent between 30 
71 2 
Poo and Рао) 6 26 
C's (Percent between 7 40 
Pio and Р) 65 58 
’s and F's (Percent 10 
below Р.о) 27 T 
Nonalgebra and 
algebra classes as 
reference groups 
A's (Percent at 9 68 
Р о and above) 8 35 
'5 (Percent between 30 32 
P. and Рао) 20 32 
5 (Percent between 51 
Pio and Ра) 55 27 
5 and F's (Percent 11 
below P,,) 17 4 


Мини. ae АНЬ 


ercentages of A's be assigned in honors 
Classes and in other classes, such as phy E A нуе, 
able, college-bound students? Certainly, each teac! at n differ- 
Brading policy in each of his classes without making | се m stion, in 
ences in competency among groups is unjustifiable. е ge g 


Same subject? Should larger p 


528 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


measurement terms, becomes, What reference group should be used as the 
basis for defining the A-level and other levels of achievement? 

It would seem that the number of students in a class that had average 
T-scores above 58 (Ps, for all ninth-graders) on the tests used in grouping 
might be the approximate number to receive A's. If this basis were used, 
the number of A's could vary considerably from class to class, but there 
would still be about 20 percent A's in "all classes combined." Let из 
assume that Mr. Brown has one class in basic mathematics, another in 
general mathematics, one regular algebra class, and an honors algebra 
class. Let us further assume that each of these classes is representative of 
all students enrolled in these subjects, Then if we used the entire ninth 


grade as our reference group, the percentage of each mark would be 
approximately as follows: 


Percentage of Students Assigned Each 
Mark in Mr. Brown's Classes 


BASIC GENERAL REGULAR HONORS 


MATHE- MATHE- ALGEBRA ALGEBRA 
MATICS M 
Percent of A's (Pgo and above in RIS CLASS CLASS 
related tests) 2 " M se 
Percent of B's (P .—Pso) 6 25 A : 
Percent of C's (P,,-P5,) 65 58 ~ 
Percent of D's and F's (below Р,,) 27 7 


At least two problems arise when a procedure like this is followed: 


1. The algebra students may be graded too liberally in terms of the opulation 
of students with which they will be competing in college and оса 
for example, in this hypothetical situation, 93 percent of is regular algebra 
students would receive "recommended grades," which vei. md to stu- 


dents, parents, and college authorities evidence of relative strength in mathe- 
matics. 


2. Students who spend most of their academic lives in groups that consist of 
average and below-average students are almost condemned to receiving 10% 


marks, regardless of the amount of effort they expend or the amount 0 
growth they make toward the objectives of the course. 


In the light of this illustration, it seems undesirable that the referent? 
group be the general school population. It may be more desirable for the 
reference group to be all students enrolled in a subject field. If grades are 
reasonably comparable within a subject field they can Баета P lequately 
for the first, second, and fourth purposes (as noted on page 512) without 
having negative effects upon the motivational level ti "eli-conospts 0 
students of average or below-average scholastic aptitude. In other wo! "i 


Reporting Data about Individual Students 529 


we might use multiple reference groups and rely upon the fact that those 
who need to interpret marks, such as admissions officers and counselors, 
know that basic mathematics is usually concerned with remedial work in 
arithmetic and that general mathematics students tend to be a less select 
group than algebra students. 

In other words, differentiated courses could be set up and clearly identi- 
fied by name, for example, the remedial mathematics course would be 
called basic mathematics, while accelerated mathematics courses would be 
labeled on the cumulative record as honors courses. Then, if a school 
defined an A as representing the top 20 percent of its students with respect 
to competency in any subject field, approximately 20 percent of the 
Students in basic, regular, or honors mathematics who excelled in their 
Progress toward the objectives of the course could receive A's. Under such 
à plan, however, a conscientious plodder in remedial courses might con- 
ceivably become valedictorian; while students would shun honors courses 
because of the difficulty of attaining the position of top 20 percent in 


Such courses, 


It seems that the routine application of t | 
using the total school, or enrollees in a specific subject, as the reference 


group) leads to undesirable complications. Since teachers vary widely in 
their generosity, however, some sensible compromise needs to be reached 
So that grades can have greater comparability than they do at the present 
time. A reasonable compromise might be to use Scannell's general approach 
in modifying the general grade distribution for specific classes except that 
the reference group for college-preparatory classes should be the college- 
bound students (with whom they will be competing) while the reference 
group for classes that are not college preparatory should be the remainder 
Of the student body. In other words, the number of students in a specific 
Class Scoring above Ра, for the nonalgebra classes would determine the 
approximate number of A’s in basic mathematics and general mathematics; 


While the combined algebra classes would constitute the reference groups 
for both regular and honors sections of algebra. If this procedure were 


followed, the percentages for Mr. Brown’s classes might be as follows: 


either of these plans (that is, 


Percentage of Students Assigned Each 


Mark in Mr. Brown’s Classes 
BASIC GENERAL REGULAR HONORS 
MATHE- MATHE- ALGEBRA ALGEBRA 
MATICS MATICS CLASS CLASS 
68 
Percent of A's 8 35 E 32 
Percent of B's 20 А s 
Perce ^ 55 
nt of D; 4 11 


Percent of D's апа F's 17 


530 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


According to this modification, an honors or regular class of college- 
bound students would be compared with a reference group of college-bound 
students. Students would not shun honors courses because of the probable 
negative effect on their grades. If a specific class in algebra had a higher 

roportion of able students than other classes, a larger-than-average per- 
centage of A and B grades could be assigned. Teachers would have com- 
plete freedom in deciding to which students each mark should be assigned. 
In fact, modification in the standard distribution of grades would be 
recommended rather than imposed. Certainly, the number of students who 
receive failing grades should not be determined by formula; nor should all 
students in honors classes receive A or B grades if their work does not 
justify such marks. An attempt would be made to convince the teachers of 
the advantage of voluntary cooperation in improving the comparability of 
grades. 

The proposals made in this chapter section have been based on the 
premises that (1) single classes cannot logically be graded on any school- 
wide grading policy and (2) grading policies should be modified for specific 
classes on the basis of interclass differences with respect to some test (or 
tests) maximally related to differences in student competency in the sub- 
ject. If the adjustment can be made on the basis of a common achievement 
test or anchor test, we have the advantage of using a variable maximally 
related to achievement in the course. Otherwise, we can make the adjust- 
ment on the basis of some variable that has high concurrent validity aS a 
substitute for such an anchor test. 

The use of a related test or tests has at least two advantages over the 
use of an anchor test: (1) teachers are not threatened by the possibility 
that class differences in average test score might be interpreted as герге” 
senting differences іп teaching effectiveness and (2) information concern- 
ing the approximate distribution of marks would be available at the 
beginning of the year so that grades assigned during the year could be 
consistent with final grades. If, in terms of the best data available, 0n€ 
class seems likely to merit approximately 10 percent A's and another 3 
percent A's, differences in grading policies could be initiated early. These 
data from related tests, however, should constitute only an aid to the 
teacher’s professional judgment; whenever the teacher can devise mot 
valid bases for modifying school grading policies, he should be encourage 
to use them and to share them with his coworkers. One disadvantage o 
using related tests (as a basis for modifications of school grading policy? 
is that the plan fails to take into account differences in teaching effective" 
ness in facilitating student learning during the semester or year; whereas 
such differences would be reflected in a final examination or anchor test. 

Perhaps the best way for an administrator to help teachers to adapt 
standard grading policy to their own classes would be as follows: 


Reporting Data about Individual Students 531 


translate the grading policy regarding percentages of each mark into per- 
centile scores that represent these percentage ranges; (2) translate these 
percentile scores into equivalent scores on anchor tests or related tests for 
college-bound and other students; (3) prepare a report showing the per- 
centage of cases in each class falling within those limits. These would 
constitute recommended percentages of A's, B's and the like for each class. 
For example, Mr. Brown would receive a sheet for algebra classes in 
which his honors and regular classes would be shown along with all college- 
preparatory classes in mathematics. He would receive another sheet in 
which his classes in basic and general mathematics would be shown in com- 
parison with all noncollege-preparatory classes in mathematics. 

It is recognized that a number of high school courses cannot readily 
be classifiable as college-preparatory or not. All students, for example, are 
required to take American history. If classes in such subjects are sectioned, 
the higher-ability sections could be classified as college bound. In subjects 
like typewriting, art, music, home economics, and industrial arts, the 
teacher's subjective judgment about whether a class was a relatively high- 
achieving or low-achieving class might constitute the best basis for modi- 
fying the general grade distribution. Over a period of several semesters, 
however, a teacher's deviations from school grading policy should average 
Out so that his cumulated distribution of marks would approximate the 
recommended distribution. 


SUMMARY STATEMENT 


ssential if measurement and evalu- 
tribution to the understanding of 
cords can be used to provide a 


A functional cumulative-record system is e 
ation data are to make their maximum con 
Students and their problems. Cumulative re 
longitudinal picture of each student's development, to help teachers in under- 
Standing each student's special needs and problems, and to assist school per- 
Sonnel in helping the parent to achieve a more objective picture of his son or 
daughter. In addition, they constitute the official record of a student's attend- 
ance, scholarship, promotions, and graduation. : e 

A good cumulative-record system should have the following characteristics: 
(1) А cumulative record should be maintained for each student. (2) The 
record should be transferred with the student (at least in summary form). 
(3) The record should be comprehensive. (4) The forms should be simple 
and easily understood. (5) The cumulative-record system should be flexible, 
Tequiring a minimum of data for all students but permitting great latitude in 
the types of additional data cumulated. (6) The records should be designed to 
Teveal trends in growth over a period of years. (7) Although the records should 
be readily accessible to teachers, the confidential nature of the data must be 
Tespected. (8) Cumulative records should be so maintained that the data are 
accurate, complete, and up-to-date. (9) In recording data, every attempt should 


be made to distinguish facts from personal opinions. 


532 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


In order for the data in cumulative-record folders to be most valuable, they 
should be organized for ready accessibility and culled out periodically. A sum- 
mary sheet is useful for indexing materials in the cumulative-record folder. 

Programs for reporting grades and other evaluation data should be appraised 
in terms of how well they serve administrative, guidance, motivational, and 
informational functions. Perhaps the most important function of any reporting 
plan is to provide the information necessary for sound, cooperative planning 
by students, parents, and teachers. 

New developments in reporting practices include: (1) changing the symbols; 
and (2) supplementing the symbols by (a) the development of fairly detailed 
rating scales for major objectives on which the teacher rates the student, (b) 
informal letters, and (c) parent-teacher conferences. 

In improving the validity, reliability, and comparability of teachers' marks; 
three essentially different problems are involved: (1) how to improve the 
validity and reliability of the sampling of evidence used as the basis for grading, 
(2) how to weight and combine the cumulated data for à semester or year into 
a single composite score for each student, and (3) how to divide the distribu- 
tion of composite scores for the assignment of A, B, C, D, and F grades. 


SELECTED REFERENCES 


ALEXANDER, WILLIAM M., "Reporting to Parents—W. 
Education Association Journal, vol. 48 (Dec 

CAGLE, DAN F., AND RAY C. HEISCHMAN, *How 
and Reporting of Student Achievement 
the National Association of Secondary 
1955), pp. 24-30. 

DOBBIN, JOHN E., "What Parents Need to 
National Elementary Principal, vol. 34 

DUROST, WALTER N., "How to Tell Parents 
Test Service Notebook, No. 26. New 
Inc., 1961. Available on request. 

HARRIS, F. E., "Three Persistent Educational Problems: Grading Promoting: 
and Reporting to Parents,” Understanding the Child, vol. 23 (April 1954), 
pp. 34-42. 

HORST, PAUL, "How Much Information on Т, 
Students: Views of a Research Psycholo. 
chology, vol. 6 (Fall 1959), pp. 218-222. 

KELLY, ELDON C., "A Study of Consistent 
Grades and Term-End Examination Gra н " топа 
Psychology, vol. 49 (December 1958), m our Nem 

LACEY, OLIVER L., “How Fair Are Your Grades?", American Association 91 
University Professors Bulletin, vol. 46 (September 1960), pp. 281-283. 

LAFRANCHI, EDWARD H., “High School Marks: Comparative or Individual,” 
School Executive, vol. 71 (July 1952), pp. 51-54 

PLOGHOFT, MILTON, “The Parent-Teacher Conference as a Report of P upil 
Progress: An Overview," Educational Administration and Supervision 
vol. 44 (March 1958), рр. 101-105, 


hy? What? How?" National 
ember 1959), рр. 15-28. | 

May We Make the Evaluation 
More Meaningful?”, Bulletin of 
School Principals, vol. 39 (April 


Know About Tests and Testing,” 
(September 1954), pp. 152-160. 
about Standardized Test Results, 
York: Harcourt, Brace & World, 


est Results Should Be Given tO 
gist,” Journal of Counseling PSY- 


Discrepancies between Instructor 


Reporting Data about Individual Students 533 


ROTHNEY, JOHN W. M., Evaluating and Reporting Pupil Progress, What Research 
Says to the Teacher, No. 7. Washington, D.C.: National Educational Asso- 
ciation, 1955. 

SCANNELL, DALE P., “Making Grades Meaningful: A Proposal,” The University 
of Kansas Bulletin of Education, vol. 15, No. 1 (November 1960), pp. 
26-35. 

WAHLQUIST, G. L., “How Machine Processes Save Counselor Time,” California 
Journal of Secondary Education, vol. 32 (November 1957), pp. 442—445. 

WALTON, WESLEY W., “The Electronic Age Comes to the Schoolhouse,” Systems 
for Educators, vol. 8 (January-February 1962), рр. 3-6. 

WECKLER, МОКА, “Problems in Organizing Parent-Teacher Conferences,” Cali- 
fornia Journal of Elementary Education, vol. 24 (November 1955), pp. 


117-126. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. What information about his students should a teacher obtain from their 
cumulative records at or before the opening of school? 

2. Obtain a cumulative record form designed for use at the secondary-school 
level. Summarize and classify the kinds of information requested. 

3. Evaluate the cumulative-record system of a school district in terms of the 
Characteristics listed in this chapter. Ў | 

4. Draft а one-page summary sheet on which a teacher could summarize the 
data filed in one pupil's cumulative record folder over a three-year period. 

5. To what extent should students’ test results be reported to parents? What 
guidelines might be developed to minimize the problems involved in such a 
reporting process? 

6. Study the bulletins and other mate 
assist teachers in conducting teacher-parent 
Teporting on student growth and planning coop 


ment), i ior principles emphasized. А 
7. E пен S Sind а of ia should the teacher have in order 


= 5 пер ae 
(0 appraise a student's "total achievement" as à basis for assigning marks? 


rials developed by a school district to 
conferences (for the purpose of 
eratively for continued improve- 


Using Measurement Data 
17 in Individual and 
Group Guidance 


Guidance usually includes such basic functions as (1) helping the student 
to ascertain, understand, accept, and apply the relevant facts about him- 
self in relation to facts about educational and vocational opportunities; 
(2) maximizing his adjustment to his educational opportunities in terms 
of his abilities, interests, and needs; and (3) helping him to reach work- 
able solutions to a variety of adjustment problems.? 

It is evident from a consideration of these functions that guidance in- 
volves: (1) work with students in every aspect of the school program, 
(2) knowledge concerning the individual student's abilities, interests; 
needs, and adjustment problems; (3) the use of a wide variety of formal 
and informal techniques for understanding students and helping them to 
understand themselves; (4) obtaining, recording, organizing, and inter- 
preting many types of test and nontest data; (5) skill in helping students 
to define their problems and to use more effective problem-solving tech- 
niques in working out their own constructive solutions to them; and (6) 
the cooperation of parents and the teamwork of every member of the school 
staff, together with the staff resources of psychological clinics, character- 
building organizations, family social-work agencies, law-enforcement 
groups, and the like. 

A concern for obtaining and using data for guidance purposes has 


1 Donald E. Super and John О. Crites, Appraising Vocational Fitness by Means 
of Psychological Tests (New York: Harper & Row, Publishers, Inc., 1962), p 2 

2C. C. Ross and Julian C. Stanley, Measurement in Today's Schools (Engle 
wood Cliffs, ЇЧ. J.: Prentice-Hall, Inc., 1954), р. 370. i 

зЈоһп G. Darley and others, “The Functions of Measurement in Counseling: 
Educational Measurement (Washington, D.C.: American Council on Education 
1951), p. 68. 


534 


Using Measurement Data in Guidance 535 


permeated this textbook, especially Chapter 1, in which some of the prob- 
lems facing counselors were considered, all the chapters of Part Two on 
"The Study of Individuals," and Chapter 14, on “Educational Diagnosis.” 

No consideration will be given in this chapter to such guidance tech- 
niques as the counseling interview or the dissemination of occupational 
information. There will be an attempt only to provide an overview of 
guidance responsibilities, an analysis of the functions usually performed 
by teachers and by specialized guidance workers, and a discussion of the 
issues and principles involved in the use of evaluation data in individual 
and group guidance, especially as both approaches are concerned with 
problems of educational planning and vocational choice. 


GUIDANCE RESPONSIBILITIES OF COUNSELORS 
AND TEACHERS 


Relationships of Guidance and Education 


The concept of guidance is often interpreted so broadly as to become 


almost synonymous with good education. Such interpretations are basically 
Sound in their emphasis on the teacher as a key guidance worker; in their 
claim that many guidance problems can be met more constructively by 
curricular change and individualization of instruction than by counseling 
Students on problems created by the lack of such measures; and in their 
Tecognition that guidance presupposes knowledge of the individual and that 
No full-time counselor can know adequately the needs of the 300 to 800 
Students who may be assigned to him. | КИЗҮ 
The concept of guidance as synonymous with good education is mis- 
leading, however, when used as a basis for implying that all guidance can 
be done by classroom teachers, ignoring the value of specialized training 
for certain guidance responsibilities, ог minimizing the need for coordina- 
tion of the school guidance program. The increasing complexity of modern 
Society has made it necessary for students to have more assistance in 
Making wise choices among diverse opportunities in the occupational world 
and among the varied curricular, extracurricular, and work-experience 
Opportunities in high school and college programs: As one guidance text- 
book succinctly states it, the concept of guidance as synonymous with 
800d education “will probably contribute far more to good teaching than 


to improved guidance."* 


" . Weitzel, Principles and 
*D. Welty Lefever, Archie M. Turrell, and Henry L з p 
Techniques 7 Guidance (New York: The Ronald Press Company, 1950), p. 23. 


536 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


The Roles of Specially Trained Personnel and of Teachers in the 
Guidance Program 


The best division of responsibilities for guidance must vary with the 
size of the school, the specialized staff, and facilities of the school system 
and the community in which it is located, the attitudes and training of the 
administrative-staff members, and many other factors. Many textbooks in 

idance include a variety of organizational charts for small high schools, 
large high schools, and those high schools located in large urban com- 
munities with a wealth of specialized resources available. 

Klausmeier has approached this issue by attempting to distinguish 
between (1) six major services required in an organized guidance pro- 
gram and ordinarily performed by specially trained counselors or guidance 
coordinators, and (2) services ordinarily performed by teachers. | 


SERVICES USUALLY PERFORMED BY SPECIALLY TRAINED PERSONNEL 


The six major services usually performed by specially trained personnel 
are as follows: 


1. Coordination and leadership in the appraisal program, assisting teachers 
in securing and using evaluation data to understand students as individuals. 

2. Providing leadership in a program of collecting and disseminating accurate 
occupational information to students. 

3. Providing counseling services (including direct counseling to students and 
in-service education of teacher-counselors). 

4. Coordinating group approaches to guidance, 

5. Coordinating the referral program and maint: 
tionships with out-of-school agencies. 

6. Directing research directly concerned with the guidance program? [Italics 
added.] 


aining effective working rela- 


Klausmeier emphasizes that these functions may be performed in the 
small schools by the principal and counselor, and the large schools by 


the guidance coordinator, psychologist, psychometrist, and a staff of 
counselors. | 


THE TEACHER AND GUIDANCE SERVICES As teachers increasingly de- 
velop a guidance point of view and a well-rounded background for guid- 
ance responsibilities, administrators often find it desirable to sprea 

guidance functions among a number of teachers, thus reducing the number 


5 Adapted from Herbert J. Klausmeier, 


Principl А ig 
School Teaching (New York: Harper & Ro Si Ропа df арно 


W, Publishers, Inc., 1953), p. 411. 


Using Measurement Data in Guidance 537 


of student contacts for each adviser and facilitating the development of a 
more personal adviser-student relationship. Many teachers—homeroom 
and core-class teachers as well as teachers of occupations and other special 
guidance classes—have been assigned specific responsibilities for group 
approaches to guidance or for individual counseling. Almost all teachers 
have some responsibilities that involve significant potentialities for guid- 
ance (for example, sponsoring an art club; advising the student council, 
newspaper, or annual; or teaching remedial reading). 

. The guidance aspects of regular classroom instruction are becoming 
increasingly significant. One should not underestimate the importance of 
the total contribution made to guidance by those classroom teachers who 
know their students individually and attempt to meet their personal, as 
well as academic, needs. The guidance aspects of regular classroom 
instruction may be summarized as follows: 


l. Appraising individual students 
There should be a two-way exc 
and teachers: 

а. Teachers 
ord, and 

b. Teachers feeding into counse 
such materials as anecdotal records, 

ical summaries of qualitative judgm 

tudinal picture of the student's growt 

des and interests 


hange of information between counselors 


being supplied with pertinent data from the cumulative rec- 


lors and the cumulative-record system 
reports of interviews, and period- 
ents that aid in building a longi- 
h and his changing needs. 


N 


Helping students to discover their арійи 
As an integral part of a good instructional and cocurricular program, the 
teacher helps each student in discovering his own strengths and weaknesses, 


both with respect to his past achievement and with respect to any special 
aptitudes that may have significance for educational and vocational plan- 


ning. 
3. Practicing informal personal guidance in the classroom 
Informal guidance contacts include discussing with students probable reasons 
for poor scholarship and necessary steps toward improvement, and holding 
many informal interviews with students when they come in with personal 
problems after school, or in club activities that the teacher sponsors, and the 


like. 


Creating success experiences and helping students to interpret and build con- 


Structively on necessary failure experiences 


a. Through adaptation of school tasks to 
nesses, and through the provision of pres 


st i 
udents echniques to place students in group 


b. Through the use of sociometric t t à 
situations in which they have improved chances for succeeding socially 
c. By helping students to interpret necessary failure experiences as a basis 


for future growth and to prevent or minimize future failure experi- 


ences. 


individual strengths and weak- 
tige-giving experiences for 


538 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


5. Disseminating occupational information relevant to the subject taught 
6. Identifying students who need special help and referring such students to 
specialized guidance personnel 


Since teachers are the only staff members who have regular daily contacts 
with students (with opportunities to observe them in peer relationships, 
experiences of success and failure, attitudes toward assigned tasks, and other 
relationships which reveal adjustment problems), they have a special respon- 
sibility for identifying and referring those students who need assistance from 
specialized guidance personnel. 


This analysis of guidance functions is intended to help clarify the 
inevitable overlapping of, and close interrelationships between, the guid- 
ance functions of teachers and of specialists, as well as ће need for 
coordination and leadership. 


ISSUES AND PRINCIPLES INVOLVED IN THE USE OF 
MEASUREMENT DATA IN GUIDANCE 


The foregoing review of the guidance responsibilities of school personnel 
has undoubtedly called to mind many techniques of measurement and 
evaluation that can aid students and counselors in their cooperative attack 
upon guidance problems. Knowledge of measurement data on students’ 
cumulative record forms and familiarity with the contents of their record 
folders would appear to be indispensable in many aspects of the guidance 
process. It may seem strange, therefore, that the function of measurement 
in guidance and the role of the counselor in presenting data to high school 
students are controversial issues in guidance. 


Issues Concerning the Advisability of Interpreting Test Data to Students 


Those who question the advisability of interpreting test data to students 
tend to base their position on several premises, each of "e will be 
considered. 

1. That the influence of unmeasured factors (notably the student's 
"drive" or motivation) is so great that the measurable ое ру 
comparison, are relatively insignificant. As tests апа а | 
tation have improved and as counselors have learned 
nontest data in combination with test results, this 
quently offered. The problem is essentially one of d 
of behavior can be measured by tests, selecting the 
pose, and appraising, as objectively as possible, the 

The significant factor of 


ids to their interpre- 
to consider significant 
argument is less fre- 
eciding which aspects 
best tests for the pur- 


“drive” Seen other factors. 
пуе ог motivation must be kept constantly 


Using Measurement Data in Guidance 539 


in mind. Although the counselor has no score on the student's motivation, 
he does have useful data—reflected in student marks, teacher comments, 
participation in curricular and work-experience activities, and the like. 

Let us imagine, for example, that a student who is seeking the coun- 
selor’s advice on college attendance has shown good motivation and effort 
throughout his school years (as reflected in teachers’ comments in his 
cumulative record folder and a consistent record of B and C grades). His 
1Q’s on three group tests range from 88 to 100. Even with unusually 
good effort, this student has done only average work in high school. Such 
data probably constitute an adequate basis for the counselor’s encouraging 
this student to consider several vocations that require no college prepara- 
tion. Interest inventories, aptitude-test results, records of marks and 
activities, and the student’s own expressed interests in the interview situa- 
tion would provide leads which the counselor could use to approach the 
problem positively, rather than negatively. . 

Let us imagine another student, with a consistent record of 10 s above 
120 but a spotty record of grades, showing relatively poor motivation. If 
this student has good marks in his field of special interest, and appears 
now to be highly motivated by his assignment as a laboratory assistant in 
Science, the counselor probably has sufficient data for encouraging him 
to attempt college work in science, especially if the economic status of his 
family will permit him to devote full time to his studies. . А 

In both these cases, data on untested factors of motivation and economic 
status should be considered along with test data. ИНН | 

2. That existing tests have serious limitations. This criticism is a valid 
and exceedingly important one. Persons who reject ай test data on this 
basis, however, are probably reacting strongly against the tendency of 
students, and even of some counselors, to place too great faith in test data, 
that is, to feel that they will automatically provide the answers to prob- 
lems. Test data must always be used as an aid to professional judgment, 
not as a s i r it. 

арену ае emphasize the limitations of available tests and s 
Necessity for interpreting test data cautiously, in the light of all other avail- 


àble information. 


The use of tests by a vocational counselor is . · - of necessity generally not a 
Predictive process but rather a clinical procedure. A variety of data have to be 
Studied in relation to each other, and hypotheses are established. vss It should 
be noted that the term hypotheses is used, rather than панын) as their 
bases are not definite enough to warrant the term conclusion." [Italics added.] 


5 Super and Crites, op. cit, p- 533. 


540 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


Those who “view with alarm" the limitations of test data often fail to 
consider the more serious limitations of working on the basis of subjective 
judgments that may have little or no basis in fact. Four valuable properties 
of test data, as compared with more subjective judgments, can be cited: 


a. The property of reliability, accuracy, and objectivity (which serves as à 
necessary antidote to tendencies toward over- or underestimation of qualifi- 
cations by either counselor or student). 


b. The property of validity or meaning or predictive power. Good psychological 
tests, despite their limitations, have predictive value for vocational and edu- 
cational success. This predictive value can serve as a balance wheel in the 
wishful thinking with which students approach many decisions. 


с. The property of economy of efjort. As short, standardized samples of be- 
havior, tests can often supply in a short time and at a relatively low cost à 


basis of judgment-making that is a practical substitute for trial-and-error 
decisions. 


d. The final property is their normative aspect, or their indication of an indi- 
vidual's standing relative to others of similar age, background, or experi- 
ence.” ' 


3. That the use of test data in guidance tends to make the student а 
passive, dependent receiver of information, rather than an active solver 
of problems. Test data can be presented in such a way as to create а pas- 
sive mind-set on the part of the student. If the student expects the test 
data to provide specific answers to his problems, the counselor must stress 
the limitations of tests and the way in which test and nontest data should 
be checked against each other in making judgments. If test results are 
used appropriately in the counseling 
needed information and to question о 
sivity and dependence will be avoided. The student should be encour- 
aged to interpret his own results, checking his interpretations. with the 
counselors judgment, and evolve his own hypotheses concerning their 
implications for his plans. = 

4. That interpretation of test results by the counselor may threaten the 
student's concept of self and hence disturb the counseling relationship: 
This criticism represents a very real hazard ang emphasizes the need for 
great skill and caution on the part of the counselor Growth in student 
self-understanding, however, is a Very important goal of the guidance 
program and one that should be attacked skillfully, rather iux evaded. 
In a significant study, test data were interpreted io high school students 
in individual conferences by highly qualified counselors under good con- 
ditions of counselor-student rapport. Rothney reported that 60 et of 
the student reactions to the test data, presented under these. favorable 


Process, that is, merely to provide 
Т to verify hypotheses, student pas- 


7 Adapted from Darley and others, op. cit., p. 77, 


Using Measurement Data in Guidance 541 


conditions, were definitely favorable (“expectations or current plans con- 
firmed," “results higher than expected," "seemed pleased," “showed high 
interest”); 9 percent were clearly negative (“disappointed,” "skeptical"). 
The remaining 31 percent of the reactions, although they might be classi- 
fiable as neutral, implied indirect rejection. The process of gaining insight 
is emotional rather than rational.) The student's reactions to objective 
evidence need to be sounded out and examined in a manner characteristic 


of the nondirective interview. 


and in presenting or summarizing other informa- 
tion, the counselor employs an objective, impersonal approach, and seeks to 
increase the client's self-acceptance. . . . [When information is presented] in an 


impersonal, objective manner, the individual must unconsciously go through a 
Process of giving it personal meaning, and it is this “process of giving meaning” 
Which the counselor may use to help his client know himself better. Reported 
case-histories show that when this impersonal presentation has been used, sig- 
nificant self-analysis occurs. . 

No matter how perfectly standardized a test may be, the test results will be 
of no more use than the client can allow them to be. Assimilating and making 
use of test information is a problem of feelings and attitudes. . . . Letting him 
proceed at his own pace, stating his objections or approval, examining why he 
Objects or approves, expressing his feelings freely, will prove more valuable in 


In interpreting test results 


the long run. 


The student may either be able to accept the test predictions and use 
them in his thinking or he may need to distort them in some degree. The 
more the student feels free to discuss his reactions to test results with his 
counselor, the more likely it is that he will be able to comprehend them 


and accept their probable significance for him. 


Principles Involved in Interpreting Test Data to Students 


Much of the distrust concerning the use of test data in guidance has 
developed from questionable practices in this area. As a sig i смер! 
ing the problem more positively, the authors have иин а чш его 
Principles for the use and interpretation of measurement data in guidance. 


SJohn A. M. Rothney. "Interpreting Test Scores to Counselees," Occupations, 

Vol. 30 (Februar: „ 320-322. ДЕ 27. А - 

PR — те Interviewing Techniques 1n Vocational Counseling, 
MEER ONNE 11 (March-April 1947), pp. 70-73. 


Jour, B n 1. 5 
795 алина Pone. Vocational Counseling Methods," Educational and 


. 186-188. 
Psychological Measurement, vol. 9 (Summer 1949), pP 


542, ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


1. The best available tests for the purpose should be used. Selection should be 
based on the major criteria of validity, reliability, and adequacy of norms, 
rather than on such minor criteria as cost or ease of scoring. In the selection 
of aptitude tests, the availability of suitable norms and meaningful validation 
studies is especially important. If achievement tests are used as predictors, 
they should be appraised in terms of their value for this purpose. 

2. Responsibility for interpretation of test data, and the in-service education 
of teachers in test interpretation, should be placed in the hands of specially 
trained guidance workers. 

3. Test data should be considered in the context of all other available informa- 
tion. Although one can interpret extreme test scores with considerable con- 
fidence, if they are consistent with other data, one must actively seek new 
information for students for whom predictor test scores seem to indicate 
about a 50-50 chance of gaining a desired objective. 

4. Test data should be interpreted in terms of probabilities, rather than cer- 
tainties. Instead of saying, "Jim's aptitude-test score is so low that he will 
fail in algebra," one should say, "Seven out of ten students with aptitude- 
test scores as low as Jim's have made D or F grades in algebra." Expectancy 
tables (such as Table 17.1 and Figure 17.1) involving one or more pre- 
dictor variables, can be very helpful in this respect. 

. Counselors should present test data to students in such a way that (а) the 
data are brought into the counseling interview only as they help in meeting 
a need or attacking a problem formulated by the student, (b) they аге 
presented objectively and impersonally by the counselor with the student 


I 


Table 17.1 
Expectancy Chart (Based on Grades Received by 500 Students Taking 
Arithmetic Test at End of Eighth Grade) 
= ла С 
Chances in 100 of Making an 


RAW SCORES ON Algebra Grade of 
ARITHMETIC TEST DORF c Б " 
$a ARCU 

90-99 2 23 75 
80-89 5 15 40 40 
70-79 10 30 45 3 
60-69 30 40 un 
Below 60 80 20 


a 

Note: The grades received by students at each raw-score level (for example, 90-99) or 
tallied; the frequencies for each row are then translated into Percentages so that ihe predictive 
validity of the test can be interpreted in terms of chances in 100 of making each grade: 
Note that since grades represent only an ordinal scale (see Chapter 2 i 63), a cot 
relation coefficient could not be computed by the Pearson product-moment шей Ordinarily: 
the contingency coefficient would be computed if a validity coefficient were desired. See 


Quinn McNemar, Psychological Statistics, third ed. (New York ii с. 
д у : John Wil Sons, Ine 
1962), pp. 198-202. nee Wily, ain таап 


Using Measurement Data in Guidance 543 


interpreting their personal meaning for him, and (c) the student is encour- 
aged to take an active role, expressing his reactions to the test results, re- 
lating them to relevant information about his in-school and out-of-school 
experiences, and formulating hypotheses about their implications for his 
choices. 

6. Guidance workers should use special caution in interpreting data from all 
tests on which the examinee can falsify or distort his responses at will— 
for example, tests of interests, personal adjustment, and the like. 

7. Тће counselor's approach in an interview involving test interpretation should 
be conditioned by his realization that the student's interpretation of test 
data is sometimes more of an emotional than a rational process. It should be 
recognized that modifying a student's self-concept 50 that it becomes more 
realistic is a gradual process. Hence, his realization of the implications of 
test data should be the outcome of several counselor-student contacts, rather 
than a single dramatic event. Between these successive contacts, the student 
can be engaged in reading and other exploratory activities in vocational 


guidance and other relevant areas. 


GUIDANCE IN EDUCATIONAL AND 
VOCATIONAL PLANNING 


Special attention will now be given to the use of measurement data in 
connection with Problems 6 and 7 of Chapter 1, that is, helping students 
With their decisions in educational and vocational planning. A number of 
the procedures presented are also usable in employee classification. 


Counseling as a Cooperative Problem-Solving Process 


се, and one in which measurement can be of 
, H „о: 
ling students concerning "next steps" in 


educational and vocational planning. A conference between a өш: 
and counselee on next steps in life planning should раа аре 
Problem-solving process. The counselor can bring Cw бдр se bnt Miren 
in terms of his maturity, his experience In aiding in t má pnr e 
Process, his resources (in terms of data about the s си ^ a ee 
environmental factors that should be considered), as well as his specia 
Skills in the interpretation of such data. The student, nd has y 
about himself (his aspirations, and needs and m Кош е 
Measured and filed); moreover, he has the basic pe pps ity for ^ ns 
decisions about “next steps” in his life planning. а 15 1 ав ү ү 
enthusiastically or half-heartedly test out the уво: eses = e ра = =з 
Steps," or perhaps reject them altogether. He is the one who 


the consequences of his decisions. 


A central aspect of guidan 
Breat assistance, is that of counse 


Graphs for Students Scoring in Stanines 8 and 9 (IQ’s 120 and above)" 


Graph A Graph B 
High School GPA below 2.5 High School GPA 2.5 and above 
in college preparatory courses in college preparatory courses 


Freshman GPA at 
State University 


Below || Ото | 1.6 to | 2.2and Below | 1.0 to | 1.6 to | 2.2and 
1.0 1.5 2.| |Above |.0 1.5 2.1 | Above 


о Да Hels 


(А) (B) 


Freshman GPA at 
State University 


Graphs for Students Scoring in Stanines 6 and 7 (IQ's 104-119) 


Graph A Graph B 
High School GPA below 2.5 High School GPA 2.5 and above 


in college preparatory courses in college preparatory courses 


Freshman GPA at 

State University 
Below | 1.0 to 2 P ТЕ 1.0 to} 1.6 to = 
1.0 1.5 1.5 21 = 


О ПД 


(A) (B) 


Fig. 17.1 Graphs (based on local expectancy tables) for Use in 
Interpreting Data on Both Scholastic Aptitude and High School 
Grades, as a Basis for Predicting Achievement at the State University: 


Freshman GPA at 
State University 


NOTE TO STUDENTS: This graph shows how former students of this school, whose ауегаде 


scores on two scholastic aptitude tests were similar to yours, achieved during their fresh- 


Using Measurement Data in Guidance 545 


man year at the state university. Use your own grade-point average in college-preparatory 
subjects to decide whether you should use Graph A or Graph B. If your grade-point 
average is approximately 2.5, look up your expectancies in each graph, and average them. 


These graphs are based on the same general approach as those in George E. McCabe, 
"Test Interpretation in the High School Guidance Program,” Test Service Bulletin No. 93 
(New York: Harcourt, Brace & World, Inc., n. d.). Single copies are free on request to 


the Division of Test Research. 


r on the student's form; instead a code number 


^ These headings would not appea 
f the form to facilitate the counselor's using the 


should be placed in one corner o 
right chart with each student. 


Testing the Suitability of a Decision Already Made by an Individual 

Some counselors are delegated responsibility for judging the suitability 
of specific educational and vocational plans; for example, a counselor 
employed by the Veterans’ Administration is required to pass on the 
appropriateness of vocational plans for trainees. Many other counselors 
are faced with similar problems, in that à counselee seeks their aid in 
assessing the feasibility of a specific choice of college or vocation. That is, 
the counselor is asked to predict what a college admissions officer or an 
employer might decide. In such situations, the counselor who has at hand 
expectancy tables based on an adequate number of cases and relevant to 
the specific situation can interpret the student’s test data, in combination 
with his grades, to help him formulate hypotheses about the probable 
results of a specific choice. Table 17.1 illustrates а useful expectancy 
table based on a single predictor variable. Figure 17.1 introduces a second 
nontest variable and yet keeps the presentation easy to understand. The 
Use of prediction equations or profile similarity studies, discussed in a later 
Section, might prove even more useful in considering such questions. 


Choosing the Optimum Level of Work 


A decision concerning how heavy а load of college-preparatory subjects 
should be taken (as in Problem 6 of Chapter 1), ог how selective a college 
to attend, involves estimating the student's level of general academic apti- 
tude, and past achievement, as well as his level of motivation and self- 
discipline as a student. Here again an expectancy ep урин 
i dese decis m icem jte rt however, шш os 


or heavy load of college-preparatory Wor А i 
e Mei by ө data on the students study habits, level of 


aspiration, and the like. hievi 
Expectancy tables, which help a student to see that achieving a B grade 


546 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


in algebra appears possible but not probable for him, might help him to 
ostpone his college preparatory work in foreign language and concen- 
trate his study time on a subject that appears to be fairly difficult for him. 
Then, when his year's experience in algebra has provided additional data, 
he would be in a better position to make further plans. 

It is often helpful to both students and parents to avoid overemphasis 
on the ninth grade as a crucial choice point with respect to college- 
preparatory vs noncollege-preparatory courses. Although present decisions 
should be made in terms of the perspective of a long-range plan, a maxi- 
mum of flexibility for choosing alternatives at later choice points should 
be retained whenever possible. Educational and vocational planning in- 
volves a series of decisions; the road chosen at a particular choice point 
may have many branches. 


Choosing among Alternatives on the Basis of Intraindividual Differences 


When a student is attempting to choose among several vocations, and 
he has an adequate level of general ability for each of them, differential 
prediction is involved. On the basis of his intraindividual differences, W€ 
want to predict whether he will be more successful and happy in one 
vocation than in another. As was emphasized in the discussion of Problem 
7 in Chapter 1, tests that predict a student's level of achievement in аса“ 
demic work may be of little value in predicting whether he would do better 
in nursing or teaching, or whether he would find greater job satisfaction in 
one vocation than the other. 

In helping students make hypotheses involving differential prediction, 
the counselor tends to use a clinical approach to the study of test and 
nontest data. Expectancy charts can help him in checking the suitability 
of a student’s tentative decision or in estimating his probability of success 
in colleges or curricula demanding different levels of academic aptitude. 
Published research studies to assist in making differential predictions with 
respect to the student's relative chances of success in various fields are 
few and inadequate. 

The counselor can determine whether the student’s interests are more 
nearly similar to persons successfully engaged in one occupation Or the 
other. He may be able to report the probability of his gaining entrance 
into, and completing, the training programs for two or more different 
fields. But in so far as predicting the difference in his ultimate "success. 
in two or more vocations, research data provide no adequate basis for 

rediction. Fortunately, there is quite a variety of jobs within many voca- 
tional areas 50 that persons often find positions that are satisfying to them 
even though the general vocational field was a far-from-ideal choice. 


Using Measurement Data in Guidance 547 


DIFFERENTIAL PREDICTION BY MEANS OF PREDICTION EQUATIONS Some 
schools and colleges have obtained local data that help in differential pre- 
diction; for example, they may have developed prediction equations by 
which they can predict a student’s grade-point average in each of several 
curricula on the basis of their entrance-test data. When these equations 
involve two or more predictor tests, they are called multiple regression 
equations. These are similar in type to the prediction or regression equa- 
tions discussed in Chapter 3 but involve an optimally weighted combination 
of student's scores on two, three, or more predictor tests." 

In order to obtain a better understanding of multiple-regression equa- 
tions, we will use a simplified illustration. Let us assume that we have 
available students’ scores on several tests, which might be useful in pre- 
dicting whether or not they are likely to succeed in ninth-grade algebra. 
In making our predictions, we would like to take into account both the 
Most recent intelligence test (variable 1) and one other test that would 
contribute most to accuracy of prediction. 


We will call the predicted variable (grade in algebra) variable y and the 


Predictor tests will be numbered 1, 2, 3, 4, and 5. Let us assume that each 


of the five predictor tests correlates .50 with grade in algebra. 


Variable 1 (Sth grade IQ test) 
Variable 2 (arithmetic computation test) 
Variable 3 (arithmetic reasoning test) 
Variable 4 (reading test) 

Variable 5 (6th grade IQ test) 


as been artificially simplified, these are all vari- 
w an r of about this size with grade in algebra. 
in the extent to which they provide new in- 
ur tests (2 through 5) with 


Although this example h 
ables that might well sho 
These tests differ, however, 
formation, The correlations of each of the fo 
Variable 1 (the most recent intelligence test) differ as follows: 


30 (IQ and arithmetic computation) 


Ti = t 
ты = .50 (IQ and arithmetic reasoning) 
hy = 009 and reading) 

гь = -80 (IQ and previous IQ test) 


holastic aptitude is the first predictor test included. A 
Second, and sometimes a third, predictor test are selected on the basis of relatively 
high rs with the criterion (for example, grade-point average in a specific major 
field) and relatively low rs with the other predictor tests. In interpreting predicted 
Brade-point averages in different major fields, the counselor must make allowance 
for departmental differences in grading standards through the use of some type of 


Converted score. 


11 Ordinarily, a test of sc 


548 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


We can now compute А, the multiple correlation’? between y and each 
pair of variables. In each pair, we combine the most recent intelligence test 
with each of the other predictor tests. 


Ry, = -62 (algebra grade with IQ and arithmetic computation) 
Кила = 58 (algebra grade with IQ and arithmetic reasoning) 

Eja = .54 (algebra grade with IQ and reading) 

Куль = .53 (algebra grade with both recent and 6th grade IQ tests) 


According to these results, if we can use only two tests in prediction, the 
best test to select from this group would be test 2 on arithmetic com- 
putation. The use of the arithmetic computation test increases the corre- 
lation with algebra grade from .50 to .62 because this test has the lowest 7 
with test 1 and the smallest percentage of overlapping variance. 

This test provides more new information than any other single test. 
Whether or not a counselor uses prediction equations, he needs to realize 
that both he and his counselee can keep only a limited number of factors 
in mind at a time. Hence, in his selection of tests to be administered or in 
his interpretation of test data relevant to a decision, he should choose tests 
that are highly related to success on the criterion and that have a relatively 
low relationship with each other. If we were going to prepare a two- 
variable expectancy chart for students, we would choose ms arithmetic 
computation test as a second variable in preference to other tests that 
E to a greater degree with the most recently administered intelligence 
est. 

We discussed in Chapter 3 the simple regression equation in standard- 
score form 


Zy = Г Zz 


where the coefficient of z+ was r, the slope of the prediction line, When 
three variables are involved in a correlation, the equation for the predic- 
tion line in three-dimensional space is a little more complex 


и = pia ^ LEA 


where z, and 2; represent the standard score on the first and second pre- 
dictor tests (in this case intelligence and arithmetic computation) and Bi 


12 The procedures for computing R, the multiple correlation between a predicted 
(dependent) variable and one or more predictor (independent) variables are giver 
in standard textbooks on statistics, for example, Quinn McNemar, Psychologica 
Statistics (New York: John Wiley and Sons, Inc., 1962), pp. 174-187. If a perso” 
wants to obtain approximate values of a multiple R involving only three variables 
a nomograph can be used. Frederic M. Lord, “Nomograph for Computing Multiple 
Correlation Coefficients,” Journal of the American Statistical Association, УО!- 5 


(December 1955), pp- 1073-1077. 


Using Measurement Data in Guidance 549 


and В. (known as beta, ог multiple-regression, coefficients) represent the 
Weights that must be assigned the first and second predictor tests in order 
to achieve the most accurate prediction.'* 

Tn this illustrative problem, in which standard scores are used and both 
tests had the same correlation with the criterion, the beta weights would 
be equal. However, if variable 1 (Sth-grade intelligence test) had corre- 
lated .60 with success in algebra and variable 2 (the arithmetic computa- 
tion test) had correlated only .40, the intelligence-test standard scores 
would have been given more than twice the weight of the arithmetic com- 
putation scores in the prediction formula, which would be as follows: 


Zy = .53 2, + .24 2: 


For a student with а z score of +1 in ІО and —1 in arithmetic com- 
putation, the predicted z-score for algebra grade would be .29, or a per- 
centile rank of 62. However, for a student with the reverse pattern (a 
z-score of —1 in IQ and +1 in arithmetic computation), the predicted 
Z-score would be only —.29, or a percentile rank of only 38. The variable 
having the higher correlation with the criterion is given the greater weight 
in the prediction formula. 8 е . E 

Now that computers are being used to an increasing degree in providing 
Services to school systems, counselors should have made available to them 
Students’ predicted grade-point-averages in different subjects. This type of 
information would not only be helpful in guidance but would constitute a 
better basis for identifying “underachieving pupils” than the all-too-com- 
Mon plan of calling in all students with low grades, regardless of their 
aptitude for the subject field. , ' : 

The most accurate, objective basis for replying to questions regarding 
the curriculum or the vocation in which a student would do best is provided 
by substituting this student's scores in a series of multiple regression equa- 
tions. In this way, we combine relevant data, weighted so as to have 
Optimum predictive validity. The predicted scores we would obtain for each 
field could be easily compared, The difficulty is that although we have the 
Statistica] techniques for such prediction, and fairly adequate tests of many 
abilities; we do not have adequate data from longitudinal studies on the 
relationship of predictor test scores and ultimate criteria of success in 


different educational and vocational fields. Hence, for a long time to come, 
ocational guidance will involve the subjective 


differenti iction in vi 
i al predicto utilizing such techniques as inter- 


interpretation of test and nontest data, 
Pretation of profiles. 
-variable problems, and problems in- 


* The fo fficients in three obl 
rmulas for beta coe € : 
volving more than three variables, are given 1n standard textbooks on statistics. 


550 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


STUDYING PROFILE SIMILARITY The publishers of aptitude test batteries 
(such as the Differential Aptitude Tests) and of interest inventories (such 
as the Kuder Preference Record) have published profiles of mean scores 
on different subtests for a number of occupational groups. The counselor 
using such profiles should ascertain the number of cases on which the occu- 
pational profile is based and whether the sample appears to be a repre- 
sentative one. 

The publishers of a few achievement tests have also prepared profiles 
of means, to be used in student counseling. For example, the publishers 
of the Jowa Tests of Educational Development have prepared profiles of 
mean subtest scores for students, tested in high school and later graduating 
with college majors in 11 different fields. A student's profile can be com- 
pared with profiles for each of the 11 curricula to see which one it most 
resembles. Or, if the student has two or three tentative choices, his profile 
can be checked against profiles of mean subtest scores for each curriculum. 

Profiles cannot be as easily or objectively compared as the scores ob- 
tained from regression equations. Visual inspection, that is, comparing the 
student’s profile with each of several occupational profiles to get a general 
impression of similarity, is not very satisfactory. 

One can compute the D-statistic™ for each of the fields being consid- 
ered, that is, the square root of the sum of all squared differences between 
student A's interest or ability scores and the mean scores for each of the 
relevant occupational profiles. 

In Table 17.2, we have compared Donna's scores on the DAT with the 
occupational profiles for nurses and teachers. Donna's standard scores 
are very similar to the mean scores for nurses in several tests: МА (nu^ 
merical ability), MR (mechanical reasoning), CSA (clerical speed and 
accuracy), and both the language usage tests. However, we need data from 
other sources before we can conclude whether these similarities will con- 
tribute to her success in nursing. The only sizable differences are Donna’s 
moderate superiority to the nurses’ average in both verbal reasoning an 
abstract reasoning. Certainly, there is nothing in this comparison of pro" 
files that would discourage Donna from choosing nursing if the interest 


14 The D-statistic represents the distance between two points: (1) the point тер" 
resented by graphing the теап scores for an occupational group and (2) the point 
represented by graphing the corresponding scores for an individual. If two predicto 
tests are used, these are points in 2-dimensional space; for three variables they 27 
points in 3-dimensional space, and the like. For an illustration of the use of me 
D-statistic, an explanation of its meaning and a comparison with the linear at 
criminant function, which should be used when tests show substantial intercorr 6, 
tion, the reader is referred to Jum C. Nunnally, Jr., Tests and Measurements: Asse 


ment and Prediction (New York: McGraw-Hill Book Company, Inc., 1959), PP’ 


129-134. 


Using Measurement Data in Guidance 551 


Table 17.2 
An Illustration of Computation of the D-Statistic for Subtests of 
the Differential Aptitude Test? 


__________________________________–- 


Меап Standard 
standard scores Comparison Comparison 
scores for for Donna with nurses with teachers 
WOMEN 

SUBTEST OF DAT NURSES TEACHERS DIFF. DIFF.) DIFF. DIFF.” 
А 
VR—verbal 

reasoning 77 .87 1.00 +23  .0529 +13  .0169 
NA— numerical 

ability .67 1.00 75 +08  .0064  —25  .0625 
AR—abtsract 

reasoning .60 .87 .92 +32  .1024 +.05 .0025 
SR—space 

relations .73 .63 .90 4.17 0289 +27 0729 
MR-— mechanical 

reasoning 35 55 .20 —AS 0225 —.35 1225 
CSA—clerical speed 

and пон 20 .60 30 4.10 .0100  —.30  .0900 
LU-I— language 

verge 53 58 65 +412  .0144 +07 .0049 
LU-II—language 

usage; шш б 90 45 +05 0025 —45 2025 
Ауегаре .53 45 .65 .2400 5747 

49 76 


D—statistich 


"Тһе standard scores for the samples of nurses and women teachers were obtained by trans- 
lating the mean percentile ranks provided in the following bulletin into z-score equivalents 


©n the normal curve. Data from "The D.A.T.—A Seven-Year Follow-up," Test Service Bulletin 


No. 49 (Ne logical Corporation, 1955), p. 13. 
и w York: The Psychologie р | ми 
"The D-statistic is the square root of the sum of the squared differences between the indi- 


Vidual's subtest scores and the corresponding mean scores of the group with which he is being 


compared, 


i this vocational 
Inventory results and relevant nontest data seem to support this vocation 


Choice, 
When Donna's data are compared with the profile of mean scores for 
a's da 


Women teachers, we note similarity with respect to VR, AR, and LU-I. 
As compared with teachers, Donna has relatively low scores in NA, mS m. 
апа LU-II (language usage, sentences). Donna 5 B e is e a ce. 
the average for students in general, but it would be helpful to know her 
Percentile rank within the teacher group. Because there are a wide variety 


552 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


of opportunities within teaching, however, and some require less NA abil- 
ity than others, this subtest score would not, in itself, discourage considera- 
tion of teaching. Donna's lower score on MR would seem to offer no cause 
for concern. Her low score on CSA might be attributable to an emphasis 
on working accurately, which lowers the scores of some conscientious 
students on this highly speeded test. Donna's most disconcerting score is 
that in Language Usage II. Fortunately, this is an area in which she could 
markedly improve her ability if she were highly motivated. 

Computation of the D-statistics indicates that Donna's profile resembles 
that for nurses more than that for teachers. One realizes from this exam- 
ple, however, some of the disadvantages of routine use of the D-statistic 
for comparisons of profiles. Equal weight is given to differences on all tests, 
although some differences are surely more significant than others. Pro- 
fessional judgment needs be introduced into the interpretation process as 
we have done in the preceding paragraphs, 

If the SD’s of scores for each occupational group are given, one could 
apply the D-statistic only to those subtests with (1) relatively high mean 
scores and (2) relatively small SD’s. In this way, we would save work by 
eliminating from consideration tests in which most students could meet 
occupational requirements and also tests in which members of the occupa" 
tion show such a wide spread of scores that differences in this variable may 
be assumed not to be highly relevant to success. 

Since the D-statistic fails to take into account the relative weight that 
should be assigned different factors, it is inferior to the multiple-regression 
equations in predicting success from ability test scores, but may prove 
superior for factors that fail to show a linear relationship to success, such 
as interests or personality characteristics. That is, profile comparison is 
especially useful in judging the appropriateness of interests or personality 
characteristics for different vocations. Scores on these inventories ofte? 
show a curvilinear relationship to success (for example, there may be an 
optimum range of score values in certain personality traits, with both 
higher and lower scores being less predictive of success). Multiple regression 
equations, which assume a linear relationship!* cannot be used in such cases. 

In all interpretations of profiles for individual students, we must be 
careful not to read too much significance into differences that do not exce? 
the standard error. The student should review the section on the standar 
errors of differences in Chapter 3. One of the best approaches is to 
incorporate the concept of measurement error into the profile itself, a5 n 
the profile for the Differential Aptitude Tests shown in Chapter 6. 

In interpreting profiles for the Kuder Preference Record and other tests 


3 n à E 5 
15 [n a linear relationship, the higher the predictor scores, the higher the succes 
on the criterion. 


Using Measurement Data in Guidance 553 


of the forced-choice type, the counselor needs to keep in mind that the 
Scores are based on the individual's expression of preferences, not on any 
absolute measure of his degree of interest in a field. Hence, everyone's 
profile is centered around the median. There are no profiles that average 
high or average low; it is impossible for the person with high enthusiasm 
for many areas to show an elevated score in all of them. 

When we use aptitude-test profiles to help students in vocational plan- 
ning, we are not only concerned about the reliability of diflerences between 
Subtests in the student's profile. We are also concerned with whether these 
differences represent a fairly stable pattern of intraindividual differences. 
Unless the pattern has stability over time, a counselor of ninth graders has 
little justification for using it in a discussion of post-high-school plans. In a 
research study, in which ninth-grade differences between DAT subtests 
Were correlated with twelfth-grade differences, the 75 ranged from a low 
of .20 (numerical minus abstract) to .74 (mechanical minus spelling). The 
reader will recognize that the second pair of tests mentioned would have a 
much lower intercorrelation than the first pair. According to the findings 
of this study, differences among mechanical, clerical, and the over-all level 
age-numerical tests showed sufficient stability over time 
to justify interpretation. It appeared doubtful that predictions based on 
differences between verbal, numerical, and abstract scores were predic- 
tive of later differences." Although this study was made of the DAT, it 
raises questions about the justifiability of assuming the stability of profile 
differences on any test batteries that show fairly high intercorrelations 


among the tests. 


of the verbal-langu 


USE OF CRITICAL OR CUT-OFF SCORES Research studies that would 
Provide counselors with data regarding the critical minimum scores on vari- 
Ous aptitudes for different vocations are urgently needed, This approach 
has been used on the GATB (General Aptitude Test Battery); that is, 
Critical scores are provided for occupational families groups of related 
Occupations) for each of three aptitudes that seem to be most important 

he critical scores, on the average, correspond 


for that occupational group. T 5 
to the 33d perosaitle di workers in the occupations involved. Perhaps the 


Most efficient approach to differential prediction Y wood = : to 
Use cut-off scores to reject occupations or curricula or i. и“ e poe 
Ог employee does not reach a minimum score in critical abilities. Then, 
Tor curricula or vocations that still seem feasible, one would use multiple- 


5 s ici om ability test scores. 
Tegression equations for predicting success from ability 


“A Longitudinal Study of the Dif- 
16 Тего E. Doppelt and George К. Bennett, П 
ferential "Aptitude Tests," Educational and Psychological Measurement, vol. 11 


(Summer 1951), pp. 228-237. 


554 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


COMBINING GROUP AND INDIVIDUAL APPROACHES IN 
HELPING HIGH SCHOOL STUDENTS IN SELF-APPRAISAL 
AND LIFE PLANNING 


Since the interpretation of test data to students affects their self-concepts 
and may produce conflicts, every effort should be made to individualize 
the program of test interpretation. However, it is only realistic to realize 
that extensive programs of aiding students in self-appraisal can usually be 
undertaken only if group approaches are used in at least certain aspects of 
the work. Hence, we will attempt to illustrate how a combination of indi- 
vidual and group approaches might be used in a large-scale program of 
interpreting test data to students who are studying a unit in life planning. 
The remainder of this chapter section is a report of an actual experience 
in the interpretation of test data to an 11th-grade class completing a voca- 
tional-guidance unit under a core-class teacher. The various student ex- 
periences that are described took place at intervals over a period of eight 
weeks. 


Developing General Concepts Basic to Student Self-Appraisal 


Group approaches can be used in preparing students for individual 
conferences regarding their own test data. In this 11th-grade class, a short 
film (“Pups and Puzzles") was shown, emphasizing how tests of abilities, 
personality traits, and the like are used by personnel directors in modern 
industrial plants—as a basis for placing applicants in jobs in which they 
can work most effectively, rather than as hurdles to eliminate applicants. 
On the basis of class discussion of the film and home study on several 
pamphlets on self-appraisal,? basic principles regarding the values and 
limitations of test data were formulated and listed on the board. 

All types of test and nontest data that would help in self-appraisal were 
suggested and listed on the chalkboard. In turn, the values and limitations 
of achievement, aptitude, and interest tests were considered as well as 
sources of evidence that might help the student to corroborate or question 
his own test data. Significant unmeasured aptitudes and other unmeasured 


factors affecting vocational planning were listed and their relative impor- 
tance considered. 


1: Бог example, such pamphlets as Discovering Your Real Interests (Chicago: 
Science Research Associates, 1961); You and Your Mental Abilities (Chicago: Sci- 
ence Research Associates, 1959). 


Using Measurement Data in Guidance 


UA 
UA 
Un 


Studying the Significance of Test Data for Problems of Vocational Choice 


Copies of the tests students had taken were distributed to remind them 
of their content and organization. Illustrative profiles for hypothetical stu- 
dents (using the mimeographed form reproduced in Figure 17.2) were then 
distributed and described. Questions were encouraged and fully discussed. 
Typical problems (such as the flat or poorly differentiated interest profile, 
discrepancy between interest and aptitude patterns, and the like) were 
illustrated by these profiles. In the discussion, emphasis was placed on the 
results of achievement and interest tests, as well as tests of such special 
aptitudes as clerical, spatial, and mechanical, in which a low score was not 
threatening to students. There was a brief discussion of the significance of 
the VR and NA tests of the DAT as being indicative of readiness for 
college-preparatory courses. It was emphasized that some colleges admit 
only students with high scores on these or similar tests while other colleges 
provide a wide variety of courses for students with different ability patterns. 

The class then discussed the ability and personality requirements of vari- 
Ous Occupations (or occupational families). Sources of occupational infor- 
mation and means of evaluating them;'* were discussed prior to a trip to 
the school library and an explanation about their file of pamphlets on occu- 
Pations. Students discussed how to organize their self-appraisal and occu- 
Pational-information data for comparison in a paper on vocations.!? 


Preparing Profiles of Test Data for Individual Students 


One of the most time-consuming tasks involved in a self-appraisal 
Program is the preparation of individual profiles for each student. In the 
Broup-guidance class described above, each student, under teacher super- 
Vision, (1) wrote in his name and other personal data on three copies of 
his test profiles; (2) recorded raw scores on all tests; (3) recorded per- 
centile ranks on tests for which they were available, that is, achievement 
tests and interest inventories but лог aptitude tests; and (4) graphed 
Percentile ranks for achievement tests and interest inventories. The profiles 
Were then returned to the teacher. 


їз Мах Е, Baer and Edward C. Roeber, Occupational Information: Its Nature 
and Use (Chicago: Science Research Associates, 1951). 
1° As an aid to students in interpreting test and nontest data about their abilities, 
Crests, and values and using such data in educational and vocational planning, 
atz has developed a work text for eight- or ninth-graders, which contains self- 
"ppraisal charts and questionnaires, plus explanations in the teen-ager's own lan- 
Buage. Martin R. Katz, You, Today and Tomorrow (Princeton, N. J.: Educational 
esting Service, 1955). An accompanying Teacher's Guide is also available. 


int 


556 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


Preparing for Individual Interviews Concerning Self-Appraisal and 


Vocational Choice 


In preparation for each individual conference, the guidance teacher 
studied both the student's cumulative record and the profile of recent test 
data. He was alert to discrepancies between recent and earlier sets of data 
and other questions raised by the cumulated test findings. Converted scores 
for the aptitude tests were added to the profiles for most of the students. 
However, with students for whom such a presentation might be threaten- 
ing, aptitude-test data were discussed on the basis of the guidance teacher's 
interpretation of raw scores whenever he felt it advisable to introduce such 
data into the conference. 

During the individual conferences, the teacher encouraged the student 
to react to his scores and interpret their meaning for him, urged him to 
bring in and relate to the test data other relevant sources of information, 
and tried to correct any misconceptions evidenced in the student's written 
interpretation of the data. 


Analysis of an Illustrative Student Profile of Test Data 


Figure 17.2 presents the profile of test data for Dick Jones, an 11th- 
grade student. At the time that deficiency notices were sent out, Dick had 
received a failure notice in geometry and an unsatisfactory notice (D grade) 
in Spanish. His work in the English-social studies class, which he took with 
his guidance teacher, had been consistently below average; he frequently 
failed to hand in assignments and seemed to daydream and dawdle during 
supervised study periods rather than attacking his work systematically. 

Dick was such a fine-looking young man and came from such a good 
family that Mr. Peters had attributed his low achievement merely to the 
indifference or actual dislike that many boys of his age show toward Eng- 
lish and social studies. A number of times he had seen the boy working 
surreptitiously on geometry when he should have been reading a literature 
assignment; hence, he had tended to group Dick in his own mind with the 
other boys in the class who were planning to be engineers and who showed 
a marked preference for mathematics and science as compared with English 
and social studies. 

Dick's two deficiency notices, however, helped Mr. Peters to realize 
that he had no objective basis for his generalizations. Observation of Dick's 
behavior in class revealed poor work habits and some resistance to adult- 
imposed assignments. Study of his cumulative record revealed that his 
family lived in a very high socioeconomic neighborhood and that Dick wa$ 


the younger of two children, the other child being a girl five years older 
who was totally deaf. 


CALIFORNIA |DIFFERENTIAL KUDER 
ACHIEVEMENT | APTITUDE PREFERENCE 


тр. 


Read. Со 

Math. Reas. 
Math. Fund. 
Verb. Reas. 
Mech. Rea. 
Persuasive 


CALIFORNIA ACHIEVEMENT TESTS, 


ADVANCED, FORM A 8 8 
Reading Vocabulary 60 = E 
Reading Comprehension 40 8 2 
Mathematics Reasoning зо = s 


Mathematics Fundamentals 5 


DIFFERENTIAL APTITUDE TESTS, 
FORM B 


Verbal Reasoning 25 
Abstract Reasoning 30 
Numerical Ability 16 
Space Relations 45 
Mechanical Reasoning 63 


KUDER PREFERENCE RECORD — 
VOCATIONAL 


Outdoor 83 
Mechanical 70 
Computational 25 
Scientific 60 
Persuasive 46 
Artistic 88 
litera 70 
Musical 20 
Social Service 38 
Clerical 30 


Fig, 17.2 Percentile Ranks and Profile of "Dick Jones” on Three Test 
Otteries, 


558 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


An interview with the mother confirmed the teacher's hypothesis that the 
parents’ frustration concerning the handicap of the older child plus their 
desire to maintain the socioeconomic status of the family combined to make 
it seem imperative that Dick enter college and prepare for a profession. 
Dick did not question the desirability of this goal, but was having a very 
disheartening experience in trying to realize his parents’ ambitions for him. 
Like many students, Dick felt that he was being exploited by his parents 
because of their almost exclusive emphasis on grades; he felt insecure 
because he was disappointing them in a crucially important way; he was 
building up a self-concept of himself as either *dumb" or somehow unable 
to study and concentrate. 

Examination of his achievement-test data indicates a reading-vocabulary 
score somewhat above average (as would be expected from his superior 
home environment) but reading-comprehension and mathematics-reasoning 
scores that represent below-average achievement for his grade level. Espe- 
cially low was his percentile rank of only 5 in mathematics fundamentals 
—a score that seems consistent with his low computational interest on the 
Kuder. Dick had already graphed the achievement- and interest-test data; 
he had noticed that they were consistent with his failing grade in mathe- 
matics. 

In the conference, the guidance teacher revealed that Dick's percentile 
rank in numerical ability (DAT) was low and commended him, in view 
of this rank, for having earned a C in algebra in the tenth grade and а 
B in general mathematics in the ninth grade. These marks showed good 
effort in fields that were inherently difficult and uninteresting for him. 

Dick soon recognized that his tentative vocational goal of engineering 
was highly inadvisable, especially when the teacher explained that his nu- 
merical ability score would be in the lowest 1 percent as compared with 
engineering-school freshmen. 

Dick’s attention was called to the fact that his score in space relations 
was average, and that his score in mechanical reasoning exceeded approxi- 
mately two thirds of 11th-grade boys. In two subtests of the DAT most 
closely related to success in college (verbal reasoning and abstract reason- 
ing), Dick’s scores placed him in the lowest quarter and third, respectively. 
of his class. 

The need for considering other vocational possibilities was apparent to 
Dick. It seemed evident from his family background and the aspirations 
of his parents that certain seemingly logical possibilities (such as auto 
mechanic) were almost out of the question at the present time. Examina- 
tion of the interest-inventory profile revealed high interests in the mechani- 
cal and artistic areas, which, combined with his high aptitude in mechanical 
reasoning and high achievement in art courses, suggested industrial desig? 
as a possible vocational choice. The combination of high outdoor and 


Using Measurement Data in Guidance 559 


artistic interests, together with Dick's fine appearance and social accepta- 
bility, suggested the possibilities of scout leadership, provided that he could 
obtain the necessary preparation at a college with relatively low entrance 
requirements. However, Dick's relatively low interest in social service, 
confirmed by his passivity in social relationships, were contraindications 
to this vocational choice. 

Another possible vocational choice in a field taught at the local junior 
college was merchandising. This choice seemed feasible in view of Dick's 
high artistic interest and achievement, average persuasive interest, such 
important unmeasured factors as fine personal appearance, social accepta- 
bility, home background, and availability of training, and the fact that the 
local demand for young men far exceeded the supply. His artistic interest 
and ability (as evidenced by his achievement in design courses) suggested 
that he specialize in window display. Possibilities for rapid promotion of 
теп to administrative positions and the presence in the community of sev- 
eral large department stores seemed to support the advisability of this field 
as a tentative vocational choice. 

Dick seemed greatly relieved to know that he should probably not at- 
tempt a four-year program for a college degree and to contemplate his 
above-average abilities in art and mechanical reasoning. He was delighted 
to learn of vocational fields in which he might succeed and that would 
Meet his status needs and those of his family. Several next steps were indi- 
Cated— (1) active exploration (through reading, interviews, and the like) 
of the vocations suggested above and other new possibilities, (2) enroll- 
Ment in such exploratory courses as salesmanship or machine shop, and 
(3) a conference with the parents concerning the interpretation of the test 
data and such revised plans as Dick would formulate. 


SUMMARY STATEMENT 


The guidance functions usually performed by specially trained personnel and 
those usually performed by teachers were reviewed. The best division of respon- 
Sibilities for guidance varies with the size of the school, the specialized staff 
and facilities available, the attitudes and training of its administrative staff 
Members, and many other factors. | 
“Our contentions of guidance leaders who minimize the importance of test 
a in counseling were analyzed: (1) that the influence of unmeasured or 
Qualitative variables is so great that the measurable characteristics, in compari- 
Son, are relatively insignificant; (2) that existing tests have serious limitations; 
ү that the student is likely to take a passive, dependent role in the conference 
“garding test data; and (4) that interpretation of test results to a student may 
eaten his concept of self and adversely affect the counseling relationship. 
leg following principles should be observed in the use and interpretation of 
data in guidance: (1) the best available tests for the purpese should be 


dat 


560 ADMINISTRATIVE, SUPERVISORY, AND GUIDANCE ASPECTS 


used; (2) responsibility for the interpretation of test data should be placed 
with persons who are competent to handle those responsibilities; (3) test data 
should be considered in the context of all other available information; (4) test 
results and other evaluation data should be interpreted in terms of probabilities, 
rather than certainties; (5) evaluation data should be brought into the coun- 
seling interview as they help in meeting a need and should be presented objec- 
tively and impersonally, with the student being encouraged to make his own 
interpretations of their personal meaning for him and to express his reactions: 
(6) guidance workers must interpret with special caution data from all tests 
on which the examinee can vary his response at will; (7) the counselor's ap- 

roach should be conditioned by his realization that the student's interpretation 
of test data is sometimes more of an emotional than a rational process. 

As an illustration of the use of test data in counseling, two of the hypotheti- 
cal problems presented in Chapter 1 were next considered. The value of 
expectancy tables in judging the suitability of choices already made or in 
choosing the optimum level of work is emphasized. Three different approaches 
which are helpful in differential prediction (choosing among alternatives on 
the basis of intraindividual differences) were studied: (1) using prediction ог 
regression equations, (2) studying profile similarity and (3) using critical or 
cut-off scores. 

Group and individual approaches can be effectively combined in helping high 
school students to use test data in self-appraisal and life planning. An illustra- 
tive application was described of the combined use of these approaches in inter- 
preting data to an eleventh-grade class. 


SELECTED REFERENCES 


BIXLER, RAY H., AND VIRGINIA H. BIXLER, "Test Interpretation in Vocational 
Counseling," Educational and Psychological Measurement, vol. 6 (Spring 
1946), pp. 145-155. 

BORDIN, EDWARD S., “The Implications of Client Expectations for the Counseling 

Process," Journal of Counseling Psychology, vol. 2 (1955), pp. 17-21. 

„ "Four Uses for Psychological Tests in Counseling,” Educational and 
Psychological Measurement, vol. 11 (Winter 1951), pp. 779-781. 
COLLEGE ENTRANCE EXAMINATION BOARD, Manual of Freshman Class Profiles. 

New York: The Board, 1963. 

CRAVEN, E, c. The Use of Interest Inventories in Counseling. Professional Guid- 

PELIS ces, Chicago: Science Research Associates, 1961. РЕР 
Mi L. "bond G. C. GLESER, "Assessing Similarity between Profiles, Psy- 

me gica ulletin, vol. 50 (November 1953), pp. 456-473. . 

› JOHN E., Bridging the Gap in Guidance. Princeton, N.J.: Educational 
Testing Service, 1962. 
FROEHLICH, CLIFFORD Р., AND K. B. ноут, Guidance Testing and Other Student 


Appraisal Procedures or St i | 
f udent. > 
: | | s and Counselors. Chicago: 1 


GOLDMAN, LEO, Using T. i 1 : : 
Crofts, 1961, 434 pp ests in Counseling. New York: Appleton-Century 


Er S ының OTHERS, "Admissions and Guidance Research in the Uni- 
y system of Georgia,” Personnel and Guidance Journal, vol. 39 
(February 1961), pp. 452—457. | 


Using Measurement Data in Guidance 561 


HOPKE, WILLIAM, "Getting Guidance Information into the Hands of Teachers," 
School Counselor, vol. 9 (December 1961), pp. 62-65. 

MOSIER, CHARLES 1, "Batteries and Profiles,” in E. Е. Lindquist, ed., Educa- 
mee Measurement. Washington, D.C.: American Council on Education, 

5]. 

ROTHNEY, JOHN W. M., AND BERT А. ROENS, Counseling the Individual Student. 
New York: Holt, Rinehart and Winston, Inc., 1949. 

SEGEL, DAVID, AND OTHERS, Ап Approach to Individual Analysis in Educational 
and Vocational Guidance. Bulletin, 1959, No. 1, United States Office of 
Education. Washington, D.C.: Government Printing Office, 1959. 

SUPER, DONALD E., “The Critical Ninth Grade: Vocational Choice or Vocational 
Exploration,” Personnel and Guidance Journal, vol. 39 (October 1960), 
pp. 107-109. 

~, "Vocational Adjustment: Implementing a Self-Concept,” Occupations, 
vol. 30 (November 1951), рр. 88-92. 

TYLER, LEONA E., The Work of the Counselor, 2d ed. New York: Appleton- 
Century-Crofts, 1961. 

Using the lowa Tests of Educational Development for College Planning. Chi- 
саро: Science Research Associates, 1957. 


DISCUSSION QUESTIONS AND SUGGESTED ACTIVITIES 


1. Give two or more illustrations of counseling situations in which expectancy 
tables could be used to advantage. 

2. What additional data would be desirable in counseling Mary (the girl 
Whose DAT aptitude test profile is shown in Chapter 6)? In counseling Dick 
(the boy whose test data were discussed in this chapter)? 

3. What procedures might a counselor use to minimize (a) the risk that the 
Counselee would passively accept test data as authoritative and not to be ques- 
tioned, and (b) the risk that the student's self-concept might be threatened by 
the interpretation of test results. 

4. In what types of counseling situations is there need to make differential 
Predictions? 

_ 5. Comment оп the advantages and disadvantages of using regression equa- 
lions and profile analysis in problems requiring differential prediction. 

6. What data would be useful to the counselor in counseling students on the 
advisability of electing a shorthand course? 


Appendixes 


Appendix A 


APPENDIXES 


A Selected List of Standardized Tests for the Elementary and Secondary 


Schools 


TABLE I 
TABLE II 
TABLE III 
TABLE ЈУ 
TABLE V 
TABLE VI 
TABLE УП 
TABLE VIII 
TABLE IX 
TABLE X 
TABLE ХЛ 
TABLE XII 


TABLE XIII 
TABLE XIV 
TABLE XV 
TABLE XVI 
TABLE XVII 


TABLE XVIII 
TABLE XIX 
TABLE XX 
TABLE XXI 
TABLE XXII 
TABLE XXIII 
TABLE XXIV 
TABLE XXV 


Appendix B 


Achievement Test Batteries 

Achievement: Business Education 

Achievement: Foreign Language 

Achievement: Health Education 

Achievement: Language and Literature 

Achievement: Mathematics 

Achievement: Music 

Achievement: Reading and Vocabulary 

Achievement: Science 

Achievement: Social Studies 

Aptitude: Group Tests of General Mental Ability 

Aptitude: Individual Tests of General 
Mental Ability 

Aptitude Test Batteries 

Aptitude: Art 

Aptitude: Business-Clerical 

Aptitude: Foreign Language 

Aptitude: Manual Dexterity and Mechanical 
Aptitude 

Aptitude: Mathematics 

Aptitude: Music 

Aptitude: Reading Readiness 

Aptitude: Science 

Interest Inventories 

Personal-Social Adjustment 

Work-Study Habits and Skills 

Miscellaneous 


Publishers of Standardized Tests 


566 APENDIXES 


Appendix C 


Methods of Expressing Test Scores ( Based on the Normal Curve) 


Appendix D 


Selected Tables 


Appendix E _ 


Glossary of Symbols and Terms Used in Measurement 
and Evaluation 


495 


APPENDIX А 
А Selected List of Standardizd Tests for Elementary and Secondary Schools 


Table 1 
Achievement Test Batteries 


А 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN МІМА FORMS SCORING? DATES PUBLISHER REVIEWS! 


_ ы ——————--_——-— 


California Achievement Tests 1957 edition 
(CAT)—E. W. Tiegs and W. W. Clarke 


Lower Primary 1-2 90-110 2 H 1934-59 CTB 5-2 
Upper Primary 3-4.5 125-145 2 н 
Elementary 4-6 145-165 4 M,Q 
Junior High 7-9 170-190 4 M,Q 
Advanced 9-14 160-180 3 M,Q 
California Tests in Social and Related 
Sciences (CTSRS)—G. S. Adams and others 
Elementary‘ 4-8 170 2 M,Q 1946-55 CTB 5-4 
4-23 
Advanced: 9-12 170 2 M,Q 1954-55 5.4 
Cooperative General Achievement Tests 
Revised Series (GAT) 
Test 1— Social Studies 12-13 40 2 M 1937-56 ETS 5-787 
4-668 
12-13 40 2 M 5-703 


Test 2—Natural Sciences 
4-595 


225 


Table | (Continued) 
Achievement Test Batteries 


NN 


NAME OF TEST 


Test 3—Mathematics 


Essential High School Content Battery— 
D. P. Harry and W. N. Durosth 


Iowa Tests of Basic Skills (ITBS)— 
E. F. Lindquist and A. N. Hieronymous 

Test A—Arithmetic 

Test L—Language Skills 

Test V & R—Vocabulary and 

Reading Comprehension 

Test W—Work-Study Skills 
Iowa Tests of Educational Development 
(ITED) 


Test 1—Understanding of Basic Social 
Concepts 

Test 2—General Background in the 
Natural Sciences 

Test 3— Correctness and Appropriateness 
of Expression 

Test 4—Ability To Do Quantitative 
Thinking 

Test 5—Ability To Interpret Reading 


WORKING 
GRADE TIME NO, OF PUBLICATION 
LEVEL IN MINA FORMS SCORING? DATES* PUBLISHER REVIEWS? 
=== = ———- = 
12-13 40 2 M 5-420 
4-379 
9-13 200-225 2 M 1950-51 HBW 4-9 
3-9 60 2 M,Q 1955-56 HM 5-16 
3-9 67 2 M,Q 
3-9 72 2 M,Q 
3-9 80 2 M,Q 
9-13 459 1 M 1942-59 SRA 5-17 
(Norms— 4-17 
1962) 3-12 
55 1 M 5-791 
60 1 M 5-713 
60 1 M 5-197 
65 1 M 5-428 
60 1 M 5-685 


Materials in Social Studies 


695 


эе су е Ыыы ы _—————_ 


WORKING 


< 


GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN.? FORMS SCORING? DATES? PUBLISHER REVIEWS" 
Test 6—Ability To Interpret Reading 
Materials in Natural Sciences 60 1 M 5-686 
Test 7—Ability To Interpret Literary 
Materials 50 1 M 5-217 
Test 8—General Vocabulary 22 1 M 5-235 
Test 9—Use of Sources of Information 27 1 M 6-692 
Metropolitan Achievement Test (MAT)— 
R. D. Allen and others! 
Primary 1 1 95-100 3 H 1958-61 HBW 4-1 
3-1 
Primary 2 2 105-115 3 H 
Elementary 3-4 160-175 3 H 
Intermediate 5-6 250-280 3 M 
Advanced 7-9 260-290 3 M 
Sequential Tests of Educational Progress 
(STEP)! 
Level 4 4-6 455 2 M 1957 ETS 5-24 
Level 3 7-9 455 2. M 
Level 2 10-12 455 2 M 
Level 1 13-14 455 2 M 
SRA Achievement Series— 
L. Thorpe, and others* 
Grades 1-2 1-2 95-125 2 H 1954-59 SRA 5-2] 
Grades 2-4 2-4 95-125 2 H 
Grades 4-6 4-6 355-445 2 M 
Grades 6-9 6-9 300-375 2 M 
SRA High School Placement Test,! Series AP 8.5-9.5 195 1 Central 1957-61 SRA 5-22 
Scoring 
Only 


045 


Table I (Continued) 
Achievement Test Batteries 


————————————_—______ 


GRADE 


NAME OF TEST LEVEL 


WORKING 
TIME NO. OF PUBLICATION 
IN MIN.* FORMS SCORING DATES" PUBLISHER REVIEWS! 


erro rom eee 


Stanford Achievement Test— 
T. L. Kelley and others" 


Primary 1.9-3.5 
Elementary 3.0-4.9 
Intermediate 5-6 
Advanced 7-9 


1923-64 HBW 


Q 
Q 
,Q 

Q 


5 
5 
5 M 
5 М, 


———————————_____ АДА 


“Where no time Пт: is specified, approximate working time is given 
when this information is available. 


У H—hand-scored; M—machine-scored; Q-—specially-devised, quick-scor- 
ing devices provided to facilitate hand-scoring. 


“The publication dates represent the range of dates for the various 
editions, forms, and accessories making up the test, as listed in Oscar 
K. Buros, Tests in Print (Highland Park, N.J.: The Gryphon Press, 
1961). 


"The number preceding the dash indicates the number of the Buros 
Mental Measurements Yearbook; the number following the dash in- 
dicates the test entry. The code "40" is used for the 1940 Yearbook 
and "38" for the 1938 Yearbook. Reviews of previous editions of a 
test are included only if the most recent edition has not yet been 
reviewed in a yearbook. Use of parentheses around an entry number 
indicates that no evaluative reviews are included; however, informa- 
fion concerning number of forms, testing time, and the like are given, 
and, in many cases, references to reviews and articles in professional 


journals. 


" Subtests in reading, language, and arithmetic available as separates, 


* Eight scores in Social Studies 1, eight scores in Social Studies ||, and 
seven scores in Related Sciences. 


* Eight scores in American History through the War between the States, 
eight scores in American History since the War between the States, 
twelve scores in Related Sciences. 

" Five scores in mathematics, science, social studies, English, total. 

' Subtests in arithmetic, reading, science, and social studies available 
as separates. 

? Seven subtests at each level: essay test, writing, listening, reading, 
mathematics, science, social studies. 

У Subtests in language arts, arithmetic, reading, and work-study. skills 
available as separates. 

! Five scores in reasoning, reading, arithmetic, language arts, total, 
and educational ability. 

"New forms issued annually. Series K prepared for use in Catholic 
schools also issued annually. 


LS 


1 


Table 1 
Achievement: Business Education 


—————— __________________________ 


WORKING 
GRADE TIME МО. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
—————————————————————————— LLL LLL 
Hiett Simplified Shorthand Test (Gregg) 9-12 50 2 M 1951 KSTC 5-512 
National Business Entrance Testing Program 12-16 
Adults 60, 120 3 а 1938-60 NBEA 5-515 
Bookkeeping Test (5-506) 
3-368 
Business Fundamentals and General 
Information (5-508) 
3-369 
General Office Clerical Test 
(including filing) (5-511) 
3-379 
Machine Calculation Test 5-514 
Stenographic Test 5-522 
Typewriting Test 5-526 
SRA Typing Adaptability Test— 
M. Tydlaska and C. White 10-12 
Adults 45 1 H 1954-56 SRA 5-518 
SRA Typing Skills— 
M. W. Richardson and R. A. Pedersen 9-12 
Adults 15 2 H 1947 SRA (3-388d) 
Turse-Durost Shorthand Achievement Test 
10-12 50 1 H 1941-42 HBW (3-392) 


(Gregg) 


and the second form requires 120 minutes to complete the skills 
tests. One additional form of the test requiring 120 minutes is avail- 
able to schools for general testing; no scoring service or proficiency 
certificates are available for this form. 


" Two forms are administered only in certified testing centers on specified 
dates. The completed tests are scored centrally and the results re- 
ported later to the schools or employers. Certificates of proficiency 
are issued to those passing the tests. One form requires 60 minutes 


215 


Table Ill 
Achievement: Foreign Language* 


ene 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


FRENCH 


Cooperative French Test, 
J. Greenberg, and others 
Elementary 9-16 40 2 M 1932-41 ETS 3-181 
1-4 semesters (h.s.) 
1-2 semesters (college) 


Advanced 9-16 40 2 M 
Cooperative French Listening 
Comprehension Test—N. Brooks 9-16 30 2 M 1955 ETS 5-265 
LATIN 
Cooperative Latin Test—G. Land 
Elementary 9-16 40 2 M 1932-41 ETS (3-204) 
1—4 semesters (h.s.) 40-1365 
38-1065 
1-2 semesters (college) 
Advanced 9-16 40 2 M 3-204 
38-1064 


SPANISH 


Cooperative Inter-American Tests of 


Language Usage, Spanish Edition 8-13 35 2 M 1963 GTA 


ELS 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


Cooperative Spanish Test— 
J. Greenberg, and others 
Elementary 9-16 40 2 M 1932-40 ETS 40-1374 
1-4 semesters (h.s.) 


1-2 semesters (college) 
Advanced 9-16 40 2 M 1939 40-1373 


Spanish-French-German-English Common 
Concepts Foreign Language Test— 
B. H. Banathy, and others 40 2 M 1964 CTB 


project of the Modern Language Association of America, the Educa- 


* See Chapter 13 for information concerning a series of tests in French, 
tional Testing Service, and the United States Office of Education. 


Spanish, German, Italian, and Russian, developed as a cooperative 


PLS 


Table IV 
Achievement: Health Education 


a = === == == d 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


Е 
College Health Knowledge Test, Personal 


Health—T. H. Dearborn 13-16 1 н 1950-59 STANF 4-478 

Health Inventory for High School Students— 

G. Neher 9-12 40-50 1 M 1942 CTB 3-422 

Health Practice Inventory, Revised— 

E. B. Johns and W. L. Juhnke 10-16 20-30 1 M 1943-52 STANF 5-559 

Kilander Health Knowledge Test 9-13 40 2 M 1936-51 HBW (5-562) 
(40-1503) 


А 


545 


Table V 
Achievement: Language and Literature 


А 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


————————————————————————————————————————————————————————— 


LANGUAGE USAGE AND LANGUAGE SKILLS 


Barrett-Ryan-Schrammel English Test— 


New Edition 9-13 60 2 M,Q 1938-54 HBW 5-176 
40-1267 

California Achievement Test: Language— 

1957 Edition (See Achievement Test Batteries) 5-177 

Clapp-Young English Test 5-12 25 2 Q 1929 HM 3-117 


Cooperative English Test, 1960 
Revision, Lower Level 
English Expression, Part I 


Effectiveness of Expression 9-12 40 1 M 1940-60 ETS 5-179 
English Expression, Part II 

Mechanics of Expression 9-12 40 1 M (5-179) 

4-155 

Reading Comprehension 9-12 40 1 M 5-645 


Cooperative English Test, 1960 
Revision, Higher Level 
English Expression, Part I 


Effectiveness of Expression 13-14 40 1 M 1940-60 ETS 
English Expression, Part II 

Mechanics of Expression 13-14 40 1 M 

Reading Comprehension 13-14 40 1 M 


945 


Table V (Continued 
Achievement: Language and Literature 


nn SS 


WORKING 
GRADE TIME NO. OF PUBLICATION 
LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


NAME OF TEST 


Cooperative Inter-American Tests of Lan- 
Usage (for use with students of 


guag 
English or Spanish as a second language) 
English 8-13 35 2 M 1963 GTA 
Spanish 8-13 35 2 M 
Greene-Stapp Language Abilities Test 9-13 120 2 M 1952—54 HBW 5-195 
LITERATURE 
Center-Durost Literature Acquaintance Test 11-13 40 1 M 1953 HBW 5-210 
Cooperative Literary Comprehension and 
10-16 40 2 M 1935-51 ETS 4-184 
3-142 


Appreciation Test 


Ns — 


116 


Table VI 
Achievement: Mathematics 


____________________ == == —~ђе 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


o r I Imm 


ARITHMETIC AND GENERAL MATHEMATICS 


California Arithmetic Test (See Achievement Test Batteries) 5-468 

Basic Skills in Arithmetic Test— 

W. L. Wrinkle and others 6-12 40-45 2 H 1945 SRA 3-335 

A Brief Survey of Arithmetic Skills, 

Revised Edition—A. E. Traxler 7-12 20 2 H 1947-53 BM 5-467 

Buswell-John Diagnostic Chart for 

Fundamental Processes in Arithmetic 2-8 20 1 H 1925 BM 4-413 
40-1456 

Cooperative General Achievement Tests: 

Test 3, Mathematics (See Achievement Test Batteries) (5-420) 
(4-379) 

3-316 

Cooperative General Mathematics Test for 

High School Classes 11-13 40 1 M 1933-51 ETS 40-1432 

Cooperative Mathematics Test for Grades 7, 

8, and 9—B. Orshansky and H. V. Price 7-9 80 2 M 1950 ETS 5-421 

4-370 

Davis Test of Functional Competence 

in Mathematics 9-13 80 2 M 1951-52 HBW 5-422 

Madden-Peak Arithmetic Computation Test 7-12 49 2 M 1954—57 HBW 5-478 

40 3 M 1962 ETS 


New Cooperative Tests in Arithmetic 7-9 


845 


Table VI (Continued) 
Achievement: Mathematics 


__———————-—-—--——-——— ————— ——-— 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


PEE Ss ee eee 
New York Test of Arithmetical Meanings— 
J. W. Wrightstone and others 


Level One 1.9-2.1 60 1 H 1956 HBW 5-480 

Level Two 2.9-3.1 60 1 H 
Number Fact Check Sheet—R. Cochrane 5-8 25 1 M 1946-47 CTB 4-417 
Snader General Mathematics Test 9-13 40 2 M 1951-54 HBW (5-439) 

4-378 
ALGEBRA 
Blyth Second-Year Algebra Test 10-12 45 2 M 1953-54 HBW 5-443 
Cooperative Algebra Test: Elementary 
Algebra Through Quadratics— 
M. M. Martin and others 9-12 40 3 M 1932-51 ETS 4-387 
Cooperative Intermediate Algebra Test: 
Quadratics and Beyond 9-12 40 3 M 1933-51 ETS 4-388 
Lankton First-Year Algebra Test 
(End-of-Year Test) 9-12 40 2 M 1951-54 HBW 5-451 
4-394 

Larson-Greene Unit Tests in 
First-Year Algebra 9-12 8-20 2 H 1947 IOWA 4-395 
New Cooperative Tests in Algebra 

Algebral (Elementary) 9-12 40 2 M 1962 ETS 

Algebra ll (Intermediate) 10-12 40 2 M 1962 

pi M 1964 


Algebra II (Advanced) 11-12 40 


645 


СКАРЕ TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
Seattle Algebra Test (First Semester)— 
H. B. Jeffery and others 9-12 40 2 M 1951-54 HBW 5-452 
GEOMETRY 
Cooperative Plane Geometry Test— 
M. P. Martin and others 10-11 40 3 M 1932-51 ETS 4-423 
3-357 
New Cooperativé Test in Geometry 10-12 80 2 M 1962 ETS 
Seattle Plane Geometry Test (First Semester) 
—H. B. Jeffery and others 10-12 45 2 M 1951-54 HBW 5-497 
Shaycoft Plane Geometry Test 11-12 40 2 M 1951-54 HBW (5-498) 
4-433 
MISCELLANEOUS MATHEMATICS 
Cooperative Plane Trigonometry Test— 
J. A. Long and others 11-14 40 2 M 1932-51 ETS (4-438) 
40-1474 
New Cooperative Test in Analytical 
Geometry 12-14 40 2 M 1964 ETS 
New Cooperative Test in Calculus 12-14 40 2 M 1964 ETS 
New Cooperative Test in Trigonometry 11-14 40 2 M 1964 ETS 


WORKING 


————————— 


095 


Table VII 
Achievement: Music 


WORKING 
GRADE TIME NO. OF PUBLICATION 

NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
Diagnostic Tests of Achievement in Music— 
M. L. Kotick and T. L. Torgerson 4-12 60 2 M 1950 CTB 4-226 
The Farnum Music Notation Test 7-9 10 1 H 1953 PSYCH 5-246 
Kwalwasser-Ruch Test of Musical 
Accomplishment 4-12 50 1 H 1924-27 IOWA 40-1333 
Kwalwasser Test of Music Information 

7-16 40-45 1 H 1927 IOWA 40-1334 


and Appreciation 


18$ 


Table VIII 
Achievement: Reading and Vocabulary 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
————————— ——————————————À— 
Cooperative English Test: 
Reading Comprehension, 1960 Revision— 
F. B. Davis and others 
Lower Level 9-12 40 2 M 1940-60 ETS (5-645) 
(4-547) 
3-497 
Higher Level 13-14 40 2 M 
Cooperative Inter-American Tests 
ој Reading* 
Primary 1,2, L-3 40 2 H 1959-64 GTA 
Intermediate 4-7 40 2 M 
Advanced 8-13 40 2 M 
Cooperative Vocabulary Test— 
F. B. Davis and others 7-16 30 2 M 1940-53 ETS (4-213) 
3-160 
Davis Reading Test 
Series 1 11-13 40 4 M 1956-58 PSYCH 5-625 
Series 2 8-11 40 4 M 1962 
Diagnostic Reading Tests— 
F. O. Triggs and others 
Survey Section 
Booklet I K-1 1 H CDRT 
Booklet II 2 2 H 
Booklet III 3-4 2 H 


285 


Table VIII (Continued) 
Achievement: Reading and Vocabulary 


———————————————————————D 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
А aaaaaaaaasasaaiiaaaaaaaaaaamamaaaaaaassassasasasslsltll 
Survey Section 
Lower Level 4-8 48 4 M 1952-60 SRA 4-531 
Upper Level 7-13 48 8 M 1947-60 SRA 4-531 
Complete Battery 
Lower Level 4-8 2 M 1952-60 CDRT 4-531 
Upper Level 7-13 2 M 1947-60 CDRT 4-531 
Doren Diagnostic Reading Test of 
Word Recognition Skills 1-6 180 1 H 1956 AGS 5-659 
Durost-Center Word Mastery Test 9-12 60 1 M 1950-52 HBW 5-233 
Durrell Analysis of Reading Difficulty, 
New Edition^ 1-6 30-45 1 H 1937-55 HBW 5-660 
Durrell-Sullivan Reading Capacity and 
Achievement Tests 
Primary Test 2.5-4.5 55-65 1 H 1937-45 HBW 5-661 
4-562 
Intermediate Capacity Test 3-6 30-40 1 H 
Intermediate Achievement Test 3-6 45-55 2 H 
Gates Advanced Primary Reading Tests Е 
Word Recognition 2-3 15 9 H 1926-58 TC-COL (5-630) 
3-484 


Paragraph Reading 2-3 25 3 H 


696 


Пи 


WORKING 


GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. | FORMS SCORING DATES PUBLISHER REVIEWS 
Gates Basic Reading Tests 
GS—Reading to Appreciate General 
Significance 3.5-8 8-10 3 H 1958 TC-COL 5-631 
3-485 
UD—Reading to Understand Precise 
Directions 3.5-8 8-10 3 H 
ND—Reading to Note Details 3.5-8 8-10 3 H 
RV— Reading Vocabulary 3.5-8 20 3 H 
LC—Level of Comprehension 3.5-8 20 3 H 
Gates Primary Reading Tests 
Word Recognition 1-2.5 15 3 H 1926-58 TC-COL (5-632) 
4-563 
3-486 
Sentence Reading 1-2.5 15 3 H 
Paragraph Reading 1-2.5 20 3 H 
Gates Reading Diagnostic Tests, Revised^ 1-8 50 2 H 1926-53 TC-COL 5-662 
4-563 
Gates Reading Survey 3.5-10 45-60 3 H,M 1939-58 TC-COL (5-633) 
3-487 
Gilmore Oral Reading Test® 1-8 15-20 2 H 1951-52 HBW 5-671 
Gray Oral Reading Tests” 1-12 4 H 1915-63 BM 40-1571 
Iowa Silent Reading Tests: New Edition— 
H. A. Greene and others 
Elementary 4-8 49 4 M 1933-56 HBW 3-489 
Advanced 9-13 45 4 M 1927-43 


#95 


Table VIII (Continued) 
Achievement: Reading and Vocabulary 


И 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LBVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
А 
Kelley-Greene Reading Comprehension Test 9-13 65-75 2 M 1953-55 HBW 5-636 
Michigan Vocabulary Profile Test 9-16 50 2 M 1937-49 HBW 4-216 
Adults 3-166 
40-1320 
38-1171 
Nelson Reading Test, 1962 Edition 3-9 30 2 M,Q 1962 HM 4-545 
3-492 
SRA Reading Record — 
G. T. Buswell and M. Buswell 6-12 40 1 Q 1947-59 SRA 4-550 
3-502 
НИ 
^ Available in English and Spanish editions. A new Inter-American У Individually administered. 


series of tests of reading is being completed, with 1964 as the prob- 
able release date. 


596 


Achievement: Science 


Table IX 


E ————__ 


NAME OF TEST 


GRADE 
LEVEL 


WORKING 


TIME NO. OF 
IN MIN. FORMS 


SCORING 


PUBLICATION 
DATES 


PUBLISHER 


REVIEWS 


=== 


BIOLOGICAL SCIENCES 


California Tests in Social and Related 
Sciences, Part III—G. S. Adams and others 
Advanced 


Cooperative Biology Test —P. E. Kambly 


Cooperative Inter-American Tests of Natu- 
ral Sciences (English and Spanish Editions) 


Nelson Biology Test 
New Cooperative Test in Biology 


Survey Test in Biological Science— 
G. S. Adams and others 


GENERAL SCIENCE AND ELEMENTARY SCIENCE 


Cooperative General Science Test— 
P. E. Kambly and C. A. Pearson 


Cooperative Science Test for 
Grades 7, 8, and 9—P. E. Kambly 


New Cooperative Test in General Science 


New Cooperative Test in Advanced 
General Science 


80 
40 


40 
40 
40 


40 


40 


~ 


M,Q 


1954-55 
1933-51 


1959-64 
1951-54 
1964 


1959 


1932-51 


1941-51 


1964 


1964 


ETS 


ETS 
ETS 


ETS 


5-4 
4-601 


5-728 
4-605 


4-623 


4-624 
3-571 


986€ 


Table IX (Continued) 
Achievement: Science 


WORKING 


GRADE TIME МО. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 

Read General Science Test 9-10 40 2 M 1951-54 HBW (5-715) 
4-628 

Survey Test in Physical Science— 

G. S. Adams and others 7-10 40 1 M 1959 CTB 

PHYSICS AND CHEMISTRY 

Anderson Chemistry Test 10-13 40 2 M 1951-54 HBW 5-737 
4-613 

Cooperative Chemistry Test— 

P. J. Burke and J. F. Castka 10-12 40 3 M 1933-50 ETS 5-744 

Cooperative Physics Test— 

P. J. Burke and others 10-12 40 2 M 1932-51 ETS 5-751 

Dunning Physics Test 10-13 45 2 M 1951-54 HBW 5-753 
4-636 

New Cooperative Test in Chemistry 11-12 40 2 M 1964 ETS 

New Cooperative Test in Physics 11-12 40 2 M 1964 ETS 

Physical Science Study Committee Testsa 10-12 45 1 M 1959 ETS 

Toledo Chemistry Placement Examination — 

1 M,Q 1959-63 TOLEDO 


N. W. Hovey and others 13 55 
жт лл LLL111it (c es 


“Ten tests usable only by schools using the PSSC textbook. 


18$ 


Table X 
Achievement: Social Studies 


MEE L1. ——— 0 | | _ __-_ 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


oO 


California Tests in Social and Related 
Sciences, Parts 1 and II— 
G. S. Adams and others 


Elementary 4-8 120 2 M,Q 1946-55 CTB 5-4 
4-23 
Advanced 9-12 90 2 M,Q 1954-55 5-4 
Cooperative American Government Test— 
J. Haefner 10-12 40 2 M 1947-51 ETS 4-702 
Cooperative American History Test— 
H. D. Berg 11-14 40 3 M 1932-51 ETS 4-684 
Cooperative Inter-American Tests of Social 
Studies (English and Spanish Editions) 8-13 40 2 M 1959-64 GTA 
Cooperative Modern European History Test 
—F. H. Stutz 10-14 40 2 M 1932-51 ETS (4-686) 
40-1635 
38-1016 
Cooperative Social Studies Test for 
Grades 7, 8, and 9—H. D. Berg and others 7-9 80 2 M 1941-51 ЕТЅ 4-663 
Cooperative World History Test— _ 
W. Taylor and Е. Н. Stutz 10-11 40 2 M 1934-51 ETS 5-814 
9-13 40 2 M 1950-54 HBW 5-816 


Crary American History Test 
| 4-688 


885 


Table X (Continued) 
Achievement: Social Studies 


eal 


WORKING 
GRADE TIME МО. OF PUBLICATION 

NAME OF TEST LEVEL IN MIN, FORMS SCORING DATES PUBLISHER REVIEWS 
___ MMM——————————————————— 
Cummings World History Test 9-13 40 2 M 1950-54 HBW (5-817) 

4-689 

Dimond-Pflieger Problems of 

Democracy Test 9-13 40 2 M 1952-54 HBW 5-833 
Engle Psychology Test 11-13 40 2 M 1952-54 HBW 5-582 
9-12 55 2 M 1958 HBW (5-840) 


Peltier-Durost Civics and Citizenship Test 


{T a e RM NEM MM UM LL liL LL 


686 


Aptitude: Group Tests of General Mental Ability 


Table XI 


А = ~ 


МАМЕ ОБ ТЕЗТ 


WORKING 
TIME 
IN MIN. 


NO. OF 
FORMS SCORING 


PUBLICATION 
DATES 


PUBLISHER REVIEWS 


е 


Academic Promise Tests (АРТ) — 
G. K. Bennett and others 


California Short Form Test of Mental Matu- 
rity^ (CTMM-S)—E. T. Sullivan and others 
Preprimary 


Primary 
Elementary 

Junior High School 
Secondary 
Advanced 


California Test of Mental Maturityc— 
1957 Edition (CIMM) 
Preprimary 


Primary 
Elementary 
Secondary 
Advanced 


Chicago Non-Verbal Examination 


College Qualification Testse— 
G. K. Bennett and others 


6-9 


9-13 
10-16 
Adults 


10-16 
Adults 


Age 6-over? 


11-13 


90 


25 


80 


— а A 


1962 


1938-58 


1956—57 


1936—54 


1955-60 


PSYCH 

CTB 5-313 
4-282 

CTB 5-314 
4-282 


PSYCH 5-316 
40-1387 


PSYCH 5-320 


065 


Table XI (Continued) 
Aptitude: Group Tests of General Mental Ability 


——————————————————— 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN, FORMS SCORING DATES PUBLISHER REVIEWS 


NII 


Cooperative Inter-American Tests of 
General Ability! 


Primary 1,2, L3 35 2 H 1959-64 GTA 
Advanced 4-7 35 2 M 
Intermediate 8-13 35 2 M 
Cooperative School and College Ability 
Tests (SCAT) 
Level 5 4-6 70 2 M 1955-57 ETS 5-322 
Level 4 6-8 70 2 M 
Level 3 8-10 70 2 M 
Level 2 10-12 70 2 M 
Level 1 12-14 70 4 M 
Davis-Eells Test of General Intelligence 
or Problem-Solving Ability 
Primary 1-2 60 1 H 1953 HBW 5-326 
Elementary 3-6 90 1 H 
Harris-Goolenough Test of Psychological 
Maturity K-3 5-10 1 H 1926-61 PSYCH (5-335) 
4-292 
Henmon-Nelson Tests of Mental Ability, 
Revised Edition 
Grades 3-6 3-6 30 2 M.Q 1957 HM 5-342 
Grades 6-9 6-9 30 2 Q 
Grades 9-12 9-12 30 2 Q 
Grades 13-17 13-17 30 2 


165 


NAME OF TEST 


Kuhlmann-Anderson Intelligence Tests, 
Sixth Edition 
K 


Onumgow» 


Kuhlmann-Anderson Intelligence Tests, 
Seventh Edition 
D 


EF 
G 
H 


The Lorge-Thorndike Intelligence Testsi 


Level 1 
Level 2 
Level 3 
Level 4 
Level 5 


Oh'o State University Psychological Test, 
Form 21'—H. A. Toops 


GRADE 
LEVEL 


A 


ON льо — 


Adults 


WORKING 
TIME 
IN MIN. 


20-30 


20-30 
20-30 
20-30 
20-30 
20-30 
20-30 
20-30 


45 


45 
45 
45 


20 
20 


1 Verbal-34 
J Nonverbal-27 


120 


NO. OF 
FORMS 


t2 [2 [2 F2. [2 


N 


SCORING 


PUBLICATION 
DATES 


1927-52 


1927-60 


1949-59 


1919-59 


PUBLISHER 


PP 


PP 
BM 


HM 


SRA 


REVIEWS 


5-348 
4-302 


5-350 


5-359 
4-308 


265 


Table XI (Continued) 
Aptitude: Group Tests of General Mental Ability 


——————————————————————ÁÁÉÁáÁ———————————— 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


—————————————————— 


Otis Quick-Scoring Mental Ability Tests, 


New Edition 
Alpha 1.5-4 40 2 Q 1936-54 HBW 5-362 
Alpha-Short Form 1.5-4 25 1 Q 
Beta 4-9 30 4 M, Q 
Gamma 9-16 30 4 M,Q 
Pintner-Cunningham Primary K-2 25 8 H 1923-46 HBW (5-368) 
3-255 
40-1416 
Pintner-Durost Elementary Test 
Scale 1 Picture Content 2-4 45 2 Q 1940-41 HBW (5-368) 
Scale 2 Reading Content 2-4 45 2 Q 3-255 


Pintner General Ability Tests: 
Nonlanguage Series 4-9 50-60 2 1945 HBW 3-254 


Pintner General Ability Test: 
Verbal Series 


Intermediate 5-9 45 2 M 1938—42 HBW 5-329 
Advanced 9-13 55 2 M 
Raven Progressive Matrices—Form 1938 Ages 8-Adults 45 1 H 1938-58 PSYCH (5-370) 
4-314 
3-258 
40-1417 
Form 1947 (colored matrices) Ages 5—11 and 15-30 d H 1947-56 


Defective Adults 


665 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


oe 


Revised Beta Examinationi— 
D. E. Kellogg and N. W. Morton Ages 16 and over 30 1 H 1931-57 PSYCH (5-375) 
3-259 


40-1419 


SRA Non-Verbal Form— 
R. N. McMurry and J. E. King Ages 12andover 15-20 1 Q 1946-47 SRA 4-318 


SRA Primary Mental Abilities, RevisedY— 
L. L. Thurstone and T. G. Thurstone 


Grades K-2 K-2 50-60 1 H 1946-62 SRA 5-614 
4-716 
Grades 2-4 2-4 50-60 1 H 
Grades 4-6 4-6 50-60 1 M 
Grades 6-9 6-9 45-55 1 M 
Grades 9-12 9-12 45-55 1 M 
SRA Tests of Educational Ability (TEA)— 
L. L. Thurstone and T. G. Thurstone! 4-6 60 1 M 1957-58 SRA 5-377 
6-9 65 1 M 
9-12 45-50 1 M 
SRA Tests of General Ability (TOGA)m 
K-2 K-2 35-45 1 H 1957-60 SRA 
2-4 2-4 35-45 1 H 
4–6 4-6 35-45 1 M 
6–9 6–9 35-45 1 M 
9-12 9-12 35-45 1 M 
SRA Verbal Form Ages 12 and over 15-20 2 Q 1946-56 SRA (5-378) 


4-319 


POS 


Table XI (Continued) 
Aptitude: Group Tests of General Mental Ability 


0—4 


МАМЕ ОЕ ТЕЅТ LEVEL 


WORKING 
TIME NO. OF 
IN MIN. FORMS SCORING DATES 


PUBLICATION 


PUBLISHER REVIEWS 


———-—.—--—-—.——— 


Survey of Mental Maturityn — 
W. W. Clark and others 


Jr. High Level 7-9 
Advanced 10-12 
Adults 
Terman-McNemar Test of Mental Ability 7-12 


"Seven scores: spatial relationships, logical reasoning, numerical rea- 
soning, verbal concepts, language, nonlanguage, total. 

^A restricted form (B) is available for use in scholarship testing ог 
other special programs. 

"Eight scores: memory, spatial relationships, logical reasoning, по- 
merical reasoning, verbal concepts, language, nonlanguage, total. 
“A group intelligence test designed for children handicapped in their 
use of the English language (the deaf, those with reading difficulties, 
and the like). Verbal directions usable for age 6 to adult, pantomime 

directions for age 8 to adult. 

° Verbal, numerical, and information subtests available as separates. 
Information test has three scores: science, social science, total. 

f Available in English and Spanish editions. A new Inter-American 
Series of Tests of General Ability is being completed, with 1964 as 
a probable release date. 


2 M 1959 CTB 

2 M 

1 M 1941-42 HBW (4-324) 
3-263 


* Three scores: verbal, quantitative, total. 


"Levels 3, 4, and 5 have verbal and nonverbal subtests. Spanish di- 
rections are available for nonverbal battery. 


! Four scores: same-opposites, analogies, reading comprehension, total. 
1 French edition available. 


“Batteries for two lower levels include subtests on verbal meaning, 
spatial ability, perception, and quantitative ability. The battery for 
grades 4-6 includes these plus reasoning. The batteries for the two 
highest levels include verbal meaning, number sense, reasoning, and 


space. 
! Four scores: language, reasoning, quantitative, total. 


™ Spanish edition available. 


? Three scores: language, nonlanguage, total. 


$68 


Table XII 
Aptitude: Individual Tests of General Mental Ability 
(To be used only by qualified psychologists or by specially trained psychometrists 
working under their supervision) 


$$$ 


WORKING 
GRADE TIME NO. OF SCOR- PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS ING DATES PUBLISHER REVIEWS 


nn. TT 


Arthur Point Scale of Performance Tests, 


Revised, Form II Ages 5 and over 1 H 1933-47 PSYCH 4-335 
40-1379 

Columbia Mental Maturity Scale, Revised? 

—B. B. Burgemeister and others MA 3-10 years 15-20 1 H 1954-59 HBW (5-402) 

Leiter International Performance Scale Ages 2-18 1 H 1936-52 STOELTING (5-408) 
4-349 

Leiter International Performance Scale: Е " 

Arthur Adaptation Ages 2-12 1 H 1952-55 STOELTING . (5-407) 

Nebraska Test of Learning Aptitude— 

M. S. Hiskey^ Ages 4-10 1 H 1941-55 AUTHOR" 5-409 

Peabody Picture Vocabulary Test— 

L. M. Dunn Ages 242-18 15 2 Q 1959 AGS 

Revised Stanford-Binet Scales, Third Edition, 

1960—L. M. Terman and M. Merrill Ages 2 and over 1 H 1916-60 HM 5.413 
4-358 

Wechsler Adult Intelligence Scale (WAIS) Ages 16 and over 1 H 1939-55 PSYCH 5.414 


p а Е 


965 


Table ХИ (Continued) 
Aptitude: Individual Tests of General Mental Ability 
(To be used only by qualified psychologists or by specially trained psychometrists 
working under their supervision) 


_———————_—_——_—©ы——————ы———————————өөөы 
WORKING 


GRADE TIME NO.OF SCOR- PUBLICATION 
DATES PUBLISHER REVIEWS 


NAME OF TEST LEVEL IN MIN. FORMS ING 
ee .—-—Є—Є_—_——_——— —-_——єү———__ 
5-416 


Wechsler Intelligence Scale for Children 
1 H 1949 PSYCH 


(WISC)° 
— = А 
© Spanish edition of manual and record forms available. 


" Usable with children with severe motor handicaps. 

У Norms available for deaf and hard-of-hearing children attending 
residential state schools for the deaf. Published by the author 
(Marshall S. Hiskey, 5640 Baldwin, Lincoln, Neb.)- 


465 


Aptitude Test Batteries 


Table XIII 


А 


МАМЕ ОЕ ТЕЅТ 


GRADE 
LEVEL 


WORKING 


TIME NO. OF 
IN MIN. FORMS SCORING 


PUBLICATION 
DATES 


PUBLISHER 


REVIEWS 


кик 


Differential Aptitude Tests (DAT)— 
G. K. Bennett and others? 
Verbal Reasoning 


Numerical Ability 
Abstract Reasoning 
Space Relations 
Mechanical Reasoning 
Clerical Speed and Accuracy 
Language Usage 
Flanagan Aptitude Classification Tests 
(FACT)> 
General Aptitude Test Battery (GATB)* 


Guilford-Zimmerman Aptitude Survey 
I. Verbal Comprehension 


II. General Reasoning 
III. Numerical Operations 


IV. Perceptual Speed 


8-13 


8-13 
8-13 
8-13 
8-13 
8-13 
8-13 


9-12 
Adults 


9-12 
Adults 


9-16 
Adults 
9-16 
Adults 
9-16 
Adults 
9-16 
Adults 


30 


30 
25 
30 
30 

6 
35 


3 half-day 
sessions 


120-150 
(for group tests) 


25 


35 


~ 


NNNNNN 


1 


такаа ж 


© 


M,Q 


оо = Ж 


1947-63 


1951-60 


1946-59 


1956 


PSYCH 


SRA 


U.S. Empl. Service 


SHERIDAN 


5-608 


5-609 


4-715 


965 


Table XIII (Continued) 
Aptitude Test Batteries 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
i" 
V. Spatial Orientation 9-16 10 i M 
Adults 
VI. Spatial Visualization 9-16 30 1 M 
Adults 
VII. Mechanical Knowledge 9-16 50 1 M 
Adults 
Holzinger-Crowder Uni-Factor Tests4 7-12 80-90 1 M 1952-55 HBW 5-610 


Multiple Aptitude Tests (MAT), 
1959 Edition—D. Segel and E. Raskin® 7-13 175-220 1 M 1955-60 CTB 5-613 


ee! ESS 


" Spanish edition available. lary, Tool Matching, Arithmetic Reasoning, Form Matching, Mark 
" Available as separates: Inspection, Coding, Memory, Precision, As- Making, Pegboard (placing and turning), Finger Dexterity Board 


sembly, Scales, Ceordination, Judgment and Comprehension, Агић- (assembling, disassembling). 
metic, Patterns, Components, Tables, Mechanics, Expression, Reason- 
ing, Ingenuity. "Three scores in verbal comprehension, three scores in perceptual 
speed, three scores in numerical reasoning, four scores in spatial 


" Five scores: verbal, spatial, numerical, reasoning, scholastic aptitude. 


"Available only through the state employment service. Subtests on: 
Name Comparison, Computation, Three-Dimensional Space, Vocabu- visualization. 


665 


Table XIV 
Aptitude: Art 


Т 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
a 
Graves Design Judgment Test 7-16 20-30 1 M 1948 PSYCH. 4-220 
Adults 
Horn Art Aptitude Inventory 12-16 50 1 H 1939-53 STOELTING 5-242 
Adults 3-171 
Meier Art Tests 
I—Art Judgment 7-16 40-60 1 M 1929-42 IOWA 4-224 
Adults 3-172 
II—Aesthetic Perception 7-16 40-60 1 Q 1963-64 
Adults 
Tests in Fundamental Abilities of Visual Art 
—A. S. Lewerenz 3-12 85 1 H 1927 CTB 40-1329 
Adults 


——E— ____----____-_-_-_-________Н_Н_________-___н___Н-_____________________________________________ 


009 


Table XV 
Aptitude: Business-Clerical 


$$ 


WORKING 
GRADE TIME NO, OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
А 
Е. К. C. Stenographic Aptitude Теѕі— 
W. Н. Deemer 10-12 33 1 H 1944 SRA 3-372 
Adults 
Minnesota Clerical Test*— 
D. M. Andrew and others 8-12 15 1 H 1933-59 PSYCH 5-850 
Adults 3-627 
40-1664 
Psychological Corporation General Clerical 
Test 9-16 43 1 H 1944-50 PSYCH 4-730 
Adults 3-630 
SRA Clerical Aptitudes 9-12 35 1 Q 1947-50 SRA 4-732 
Adults 
Stenographic Aptitude Test —G. К. Bennett 9-16 25 1 H 1939-46 PSYCH 3-390 
Survey of Working Speed and Accuracy— 
F. Ruch 9-16 20 1 H 1943-48 CTB 3-631 
Adults 
Turse Shorthand Aptitude Test 8-12 40 1 H 1937-40 HBW 4-460 
Adults 3-393 
Turse Clerical Aptitudes Test 8-12 28 1 H 1955 HBW 5-855 
Adults 


% А Spanish adaptation is available. 


109 


Table XVI 
Aptitude: Foreign Language 


—_____________________________ aaaaaaaaaalalalauauauauaeaeaeaeaeaualalauauauauauauaeaeaeasasasasasasasese5tl— 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


ен 


Foreign Language Prognosis Test— 

M. Symonds 8-9 44 2 H 1930-59 TC-COL 4-232 
40-1340 

Iowa Placement Examinations: 

Foreign Language Aptitude Series FA-2, 


Revised—G. D. Stoddard and others 12-13 45 3 M 1925-44 IOWA 3-178 
Modern Language Aptitude Test — 
J. B. Carroll and S. M. Sapon 9-16 30-602 1 M 1958-59 PSYCH 

Adults 


_ _———-— ———— 


8 Thirty minutes working time for Short Form, 60 minutes for total test. 


209 


Table XVII 
Aptitude: Manual Dexterity and Mechanical Aptitude 


———————————————————— 


WORKING 
GRADE TIME NO. OF PUBLICATION 
IN MIN, FORMS SCORING DATES PUBLISHER REVIEWS 


NAME OF TEST LEVEL 
————————————————á c "'"'—AJSÓS O 
H 1946 PSYCH 3-659 


Bennett Hand-Tool Dexterity Tests H.S. 4-12 1 


Adults 
Crawford Small Parts Dexterity Testa H.S. 9-25 1 H 1946-56 PSYCH 5-871 
Adults 4-752 
3-667 
Minnesota Rate of Manipulation Test H.S. 10-15 1 H 1931-57 AGS 3-663 
Adults 40-1662 
O'Connor Finger Dexterity Testa 11-12 8-15 1 H 1920-26 STOELTING 40-1659 
Adults 
O'Connor Tweezer Dexterity Теза 11-12 6-10 1 H 1920-28 STOELTING 40-1678 
Adults 
O'Rourke Mechanical Aptitude Test H.S. 55 2 H 1926-57 PSYCH (5-882) 
Adults 3-672 
Prognostic Test of Mechanical A bilities— 
J. W. Wrightstone and C. E. O'Toole 7-12 38 1 M 1946-47 CTB 4-761 
Adults 
Purdue Pegboard*—J, Tiffin 9-16 10 1 H 1941—48 SRA 5-873 
4-751 


Adults 
3-666 


£09 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


И 
Revised Minnesota Paper Form Board— 


R. Likert and W. Н. Quasha^ 9-16 20 2 M 1930-48 PSYCH 5-884 
Adults 4-763 
3-677 
40-1673 
SRA Mechanical Aptitudes 9-12 40 1 Q 1950 SRA 4-764 
Adults 
Stromberg Dexterity Теме H.S. 5-10 1 H 1945-51 PSYCH 4-755 
Adults 
Mechanical Comprehension Testsc— 
G. K. Bennett and others 
Form AA H.S. boys 25-35 1 M 1940-55 PSYCH (5-889) 
Adults 4-766 
3-683 
Form BB (more difficult) Applicants 25-35 1 M 1941-51 
for Technical 
Training or 
Employment 
Form CC (most difficult) Engineering 25-35 1 M 1949 
Students 
Form W-1 HS. girls 25-35 1 M 1942-47 
Adults 
" Individually administered. * Spanish editions of Forms AA and BB available; bilingual English- 


ns il р ^ 
"A bilingual edition (with instructions in both French and English) French. edition: of Form: АА ‘available, far Eranch-Canadiah uso, 


has been prepared for French-Canadian use. A form with Spanish 
instructions is also available. 


?09 


Table XVIII 
Aptitude: Mathematics 


—————————————————————————————À 
WORKING 


NO. OF PUBLICATION 


GRADE TIME 
NAME OF TEST LEVEL INMIN. FORMS SCORING DATES PUBLISHER REVIEWS 
eS 
California Algebra Aptitude Test— 
N. Keys and M. McCrum 8-12 50 1 H 1940-58 AGS (5-444) 
4-385 
3-320 
Orleans Algebra Prognosis Test, 
Revised Edition 7-9 39 1 H 1928-51 HBW 4-396 
40-1444 
Orleans Geometry Prognosis Test, 
Revised Edition 9-11 39 1 H 1929-51 HBW 4-427 
40-1471 
Survey Test of Algebraic Aptitude— 
7-9 40 1 M 1959 CTB 


R. E. Dinkel 
eee 


$09 


Table XIX 
Aptitude: Music 


—————————————————————————D 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
Drake Musical Aptitude Tests 3-16 45-60 2 Q 1954-57 ЅКА 5-245 
Adults 3-175 
Musical Aptitude Test— 
H. S. Whistler and L. P. Thorpe 4-10 40 1 M 1950 CTB 5-250 
4-228 
Seashore Measures of Musical Talent, 
Revised Edition 4-16 60-70 1 M 1919-60 PSYCH 5-251 
Adults 4-229 
3-177 
40-1338 
Wing Standardized Tests of 
Musical Intelligence 5-16 50-70 1 H 1939-60 Natl. Found.* 5-254 


Adults 


^ Distributed by the National Foundation for Educational Research in England and Wales. (79 Wimpole St., London W.l., England). 


909 


Table XX 


Aptitude: Reading Readiness 


—————— nn SS eee 


WORKING 
GRADE TIME NO. OF PUBLICATION 

NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 

Gates Reading Readiness Tests K-1 50 1 H 1939-42 TC-COL 4-566 

3-516 

Harrison-Stroud Reading Readiness Profiles K-1 79 1 H 1949-56 HM 5-677 

Metropolitan Readiness Tests— 

G. Hildreth and N. L. Griffiths K-1 65-75 2 H 1933-50 HBW 4-570 

Monroe Reading Aptitude Tests K-1 30-40 1 H 1935 HM 3-519 
Murphy-Durrell Diagnostic Reading 

K-1 80 1 H 1947—49 HBW 5-679 


Readiness Test 


209 


Table XXI 
Aptitude: Science 


WORKING 
GRADE TIME МО. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 

Engineering and Physical Science Aptitude 

Test—B. V. Moore and others 12-16 72 1 M 1951 PSYCH 4-810 
Adults 3-698 

Physical Scientific Aptitude Examination: 

Form S—J. Lapp and others 12-13 60 1 H 1943 IOWA 3-547 

The Purdue Physical Science Aptitude Test 

—H. H. Remmers and N. A. Rosen 9-13 60 2 M 1943-60 PURDUE 


Table XXII 
Interest Inventories (to be used only under the supervision of qualified psychologists 
or counselors with appropriate training) 


a LKRLKL 


809 


WORKING 
GRADE TIME ХО. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
АА 
Kuder Preference Record— 
Occupational-Form D 9-16 20-30 1 Ma 1956-59 SRA 5-862 
Adults 
Kuder Preference Record— 
Vocational-Form C 9-16 40-45 2 M,Q 1934-56 SRA 5-863 
Adults 4-742 
Occupational Interest Inventory (OII)— 
E. A. Lee and L. P. Thorpe 
Intermediate 7-16 30-40 1 M,Q 1943-56 CTB 5-864 
Adults 
Advanced 9-16 30-40 1 M,Q 
Adults 
Picture Interest Inventory—K. P. Weingarten 7-12 30-40 1 M 1958 CTB (5-865) 
Adults 
A Study of Values— 
G. W. Allport and others 13-16 20-30 1 H 1931-60 HM 5-114 
Adults 4-92 
Vocational Interest Analyses— 
9-12 45 1 M CTB 5-870 
4-746 


E. C. Roeber and others 
Adults 


609 


WORKING 


GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
Vocational Interest Blank for Men, 
Revised Form M—E. K. Strong (SVIB) Ages 17 30-60 1 M 1927-59 CPP 5-868 
and over 4-747 
3-647 
40-1680 
Vocational Interest Blank for Women, 
Revised Form W—E. K. Strong Ages 17 30-60 1 M 1933-59 CPP 5-869 
and over 3-649 


8 Keys available for 48 predominantly masculine occupations. 


019 


Table XXIII 
Personal-Social Adjustment (to be used only under the supervision of qualified psychologists 
or counselors with appropriate training) 


А 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 


MM MM —— M ÀÀ—MÓÁ 00 


The Adjustment Inventory—H. M. Bell 


Student Form 9-16 20-30 1 M,Q 1934-39 СЕР (5-30) 
Adults 4-28 
40-1200 
38-912 
Adult Form 9-16 20-30 1 M,Q 1938-39 CPP 
Adults 
Bell Adjustment Inventory, 
Revised Student Form 9-16 30-40 1 M 1958 CPP 
Bernreuter Personality Inventory 9-16 20-30 1 M 1931-38 CPP 5-95 
Adults 4-77 
40-1239 
Billett-Starr Youth Problems Inventory 
Junior Level 7-9 60-75 1 Q 1961 HBW 
Senior Level 10-12 60-75 1 Q 
California Psychological Inventory (CPI)— 
H. Gough 8-16 45-60 1 M,Q 1956-57 CPP 5-37 
California Test of Personality (СТР)— 
L. P. Thorpe and others 
Primary K-3 40-50 2 H 1953 CTB 5-38 
3-26 
Elementary 48 40-50 2 M,Q 
Intermediate 7-10 40-50 2 M,Q 
Secondary 9-16 40-50 2 M,Q 


19 


WORKING 
GRADE TIME NO. OF PUBLICATION 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
Edwards Personal Preference Schedule 13-16 45 1 M,Q 1953-59 PSYCH 5-47 
Adults 
Gordon Personal Inventory 9-16 15 1 H 1956 HBW 5-58 
Adults 
Gordon Personal Profile 9-16 15 1 н 1953-54 HBW 5-59 
Adults 
Guilford-Zimmerman Temperament Survey 9-16 45 1 M 1949-55 SHERIDAN 5-65 
Adults 4-49 
Kuder Preference Record, Personal— 
Form A 9-16 40-50 2 M, Q 1948-54 SRA 5-80 
Adults 4-65 
Mental Health Analysis— 
L. P. Thorpe and W. C. Clark 
Elementary 4-8 45-50 1 M 1946-59 CTB 3-59 
Intermediate 7-10 45-50 1 M 
Secondary 9-16 45-50 1 M 
Minnesota Counseling Inventory— 
R. F. Berdie and W. L. Layton 8-12 45-50 1 M 1953-57 PSYCH (5-85) 
Minnesota Multiphasic Personality Inventory 
(MMPI)*—S. К. Hathaway and 
J. C. McKinley 11-16 30-90 2 M 1943-51 PSYCH 5-86 
Adults 4-71 
3-60 
Mooney Problem Check List, 1950 Rev.c— 
R. L. Mooney and L. V. Gordon 
Form J 7-9 30-50 1 M 1942-50 PSYCH (5-89) 
4-73 
3-67 


219 


Table XXIII (Continued) 
Personal-Social Adjustment (to be used only under the supervision of qualified psychologists 
or counselors with appropriate training) 


a 


WORKING 
GRADE TIME NO. OF 
NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
—ÓMMÓM OOOO LN 
Form H 9-12 30-50 1 M 1941-50 
Form С 13-16 30-50 1 M 1941-50 
Form A Adults 30-50 1 M 1950 
SRA Junior Inventory— 
H. H. Remmers and R. H. Bauernfeind 4—8 40 1 Q 1951-57 SRA 5-104 
4-90 
SRA Survey of Interpersonal Values— 
L. V. Gordon 9-16 15 1 H 1960 SRA 
Adults 
SRA Youth Inventory— 
H. H. Remmers and others 7-12 40 2 M,Q 1949-56 SRA (5-105) 
4-91 
A Study of Values— 
G. W. Allport and others 13-16 20 1 H 1931-60 HM 5-114 
Adults 4-92 
Syracuse Scales of Social Relations— 
E. F. Gardner and G. Thompson 
Grades 5-6 5-6 50-60 1 Q 1958-59 HBW 
Grades 7-9 7-9 50-60 1 Q 
Grades 10-12 10-12 50 1 Q 


* An adaptation for rural young people (ages 16-30) and an adapta- 


* Spanish edition of booklet form available. 
fion for student nurses may be purchased from Publications Office, 


P An individually-administered form (printed on cards) and a group 


test (printed in a test booklet). Ohio State University. 


#19 


Table XXIV 
Work-Study Habits and Skills 


nn === 


WORKING 
GRADE TIME NO. OF PUBLICATION 

NAME OF TEST LEVEL IN MIN. FORMS SCORING DATES PUBLISHER REVIEWS 
ee 
Brown-Holtzman Survey of Study Habits 
and Attitudes 9-16 25-35 1 M 1953-56 PSYCH 5-688 
California Study Methods Survey— 
H. D. Carter 7-13 35-50 1 M 1958 CTB (5-689) 
Spitzer Study Skills Test 9-13 150 2 M 1954-55 HBW 5-697 
Tyler-Kimber Study Skills Test 9-16 60-90 1 н 1937 СРР 40-1580 


38-1166 


о лао=-= 


зупру 
poswwadd у 8шұшу 10211420 4ә5010)-иоѕ f 


00275 ман £9-Tr6l Ww I SP-0E 91-6 
LLS-S ман SS-ES6I Ww e 0sc-cr 21-6 af иотуиәцәлйшог;у дио у uasjap2)-uao4g 
БАШАНН WaHSrIünd Salva DONDRIOOS SNO: "мим 'IHA3'T 1591 30 ЗИУМ 

чи. чаучо 


NOLLVOIIENd dO ‘ON 
ONINUOM 


snoeup||a»siw 
AXX 91991 


614 


AGS 


BM 


CDRT 


CPP 


ETS 


GTA 


HBW 


APPENDIX B 


Publishers of Standardized Tests 


American Guidance 
Service 


720 Washington Ave., S.E. 


Minneapolis 14, Minn. 


Bobbs-Merrill Co., Inc. 
4300 W. 62 St. 
Indianapolis, Ind. 


Committee on Diagnostic 
Reading Tests 

419 West 119 St. 

New York 27, N. Y. 


Consulting Psychologists 
Press, Inc. 

557 College Ave. 

Palo Alto, Calif. 


California Test Bureau 
Del Monte Research Park 
Monterey, Calif. 


Educational Testing 
Service 

20 Nassau St. 

Princeton, N. J. 


Guidance Testing 
Associates 

6516 Shirley Ave. 

Austin 5, Texas 


Harcourt, Brace & World 
757 Third Ave. 
New York 17, N. Y. 


HM 


IND 


IOWA 


IPAT 


KSTC 


MINN 


NBEA 


PA 


Houghton Mifflin Co. 
2 Park St. 
Boston 7, Mass. 


Indiana State High School 

Testing Service 
Purdue University 
Lafayette, Ind. 


Bureau of Educational 
Research and Service 
State University of Iowa 
Iowa City, Iowa | 


Institute for Personality 
and Ability Testing 

1602 Coronado Dr. 

Champaign, Ill. 


Bureau of Educational 
Measurements 

Kansas State Teachers 
College 

Emporia, Kans. 


University of Minnesota 
Press 
Minneapolis 14, Minn. 


National Business 
Education Assn. 

1201 Sixteenth St., N.W. 

Washington 6, D. C. 


Psychometric Affiliates 
Box 1625 
Chicago 90, Ill. 


615 


616 APPENDIXES 


РР Personnel Press, Inc. STOEL- С.Н. Stoelting Company 
188 Nassau St. TING 424 North Homan Ave. 
Princeton, N. J. Chicago 24, Ill. 


TC-COL Bureau of Publications 
Teachers College 
Columbia University 
New York 27, N. Y. 


TOLEDO The Research Foundation 


PSYCH The Psychological 
Corporation 
304 East 45 St. 
New York 17, N. Y. 


PURDUE Purdue University University of Toledo 
Bookstore 2801 West Bancroft St. 
360 State St. Toledo 8, Ohio 


West Lafayette, Ind. 
> чу USES U.S. Employment Service 


SHERI- Sheridan Supply Company Tests available to schools 


DAN P.O. Box 837 only when they are kn 
Beverly Hills, Calif. ш coopération Wie 
State Employment Serv- 
SRA Science Research ice offices. 
Associates, Inc. VET Veterans’ Testing Service 
259 East Erie St. American Council on 
Chicago 11, Ill. Education 
1785 Massachusetts Ave., 
STAN Stanford University Press N.W. 


Palo Alto, Calif. Washington 6, D. C. 


APPENDIX C 


Methods of Expressing Test Scores 
(Based on the Normal Curve) 


In Chapter 2, the reader was introduced to the normal curve and to the 
standard deviation (SD). The symbol о (sigma) is used in Figure A.1 to repre- 
sent the standard deviation. By using the standard deviation as a unit, it is 
possible to compare students’ scores on tests of intelligence, various aspects 
of achievement, and other characteristics by ascertaining the position of each 
score or other datum in a normal frequency distribution. 

In Chapter 2 the reader was also introduced to the concept of standard 
score, by means of which test scores are expressed in terms of their deviation 
from the average in standard-deviation units. It was explained that in order 
to avoid the negative numbers and decimal points of the original standard 
Scores, several systems of equated scores have been developed, based on the 
use of 50, 100, or some other arbitrary number to represent the average, and 
10, 20, or some other arbitrary number to represent the standard deviation. 
Figure A.l illustrates the fundamental equivalence of various systems of 
equated scores. 

In Chapter 6, the deviation IQ was described as a type of normalized 
Standard score, by which a student's intelligence-test score could be compared 
with the scores of other members of his age group: if he were tested at the 
age of 10, with those of other 10-year-olds; if he were retested five years later, 
with those of other 15-year-olds. In each case, his raw score on the test would 
be transformed into a percentile rank within his own age group. Then his 
percentile rank within a normal distribution of 10-year-olds and 15-year-olds, 
Tespectively, would be translated into a type of normalized standard score 
(known, in this case, as a deviation IQ). 

The standard-score equivalent (z) of a student's raw score is obtained by 
computing the deviation of that score from the mean and dividing that 

RAE. X—M 
deviation by the standard deviation у z= SD 
indicates the difference between a raw score and the mean of the distribution, 
expressed in standard-deviation units. A z score of +1 is one standard devia- 
tion above the mean; a z score of —1.5 is one and one-half standard deviations 


) In other words, the z-score 


below the mean. 
This same principle is used in setting up all systems of equated scores based 


617 


618 APPENDIXES 


Per cent cf coses 
under portions of 


the normel curve 0.13% 2.14% — 01325. 


d 
posean Em ET ET 0 +19 +20 +30 m 
f ' 1 1 1 1 1 
Cumulative Percentoges 0.1% 23% 15.9% 500% 241% 977% 999% 
“| T T T T" T | 
Percentile 
Equivelonts 1 з 10 |20]30 40 50 60 70] 80| 90» ” 
| NE at |" | 


Typical Standard Scores 


zrcores L 4 — __|___.___ -A— — 1 — 1 — —— 
-40 -30 -20 EI о 


+10 420 430 +40 

Tacrs L ой SS 5 4. 41 " —H— — | 
20 30 40 ЕЈ о 70 80 

Cees scores ш ааа аар |__.__:__.___ 
200 300 400 500 во 700 800 


AGCT scores Lou р 1 


| T T 1 T YT T T | 


Stanines 


Т | => [а | [= |] г ICT 2 
Per cent in stanine | 4% 7% 12% 17% 20% 17% и» 7% 4% | 
Wechsler Scales | | | 
нн ee мз NR RR RR 
П 4 7 10 13 16 19 
Deviation IQs Lomm t 1 1 L CL —— 
55 70 85 100 n5 130 145 
А m 
Fig. A.1 


Chart Showing the Equivalences of Various Systems of Equated 
or Standard Scores. 


Reproduced with the permission of the publisher from Test Service Bulletin No. 48 (New York: 
The Psychological Corporation, 1955). 


on the normal curve—whether they are percentile ranks, T-scores, AGCT 
(Army General Classification Test) scores, or deviation IQ's. Examine the 
normal curve in Figure A.1. Note that there are no raw scores printed on the 
baseline. Hence, the person setting up a system of standard scores is free to 
use any numerical scale he chooses.! If he wishes to use regular standard 
scores (z-values), he sets the mean raw score of his test results equal to zero. 


' However, the development of so many different kinds of standard scores is 
deplored in Technical Recommendations for Psychological Tests and Diagnostic 
Techniques (Washington, D. C.: American Psychological Asdan. 1954) The 
committee recommends that the T-score (with a M of 50 and а SD f 10) be used 
when a two-digit standard score is desired; and that stanines ( ith 2 M of 5 and 
a SD of 2) be used when a one-digit standard score is desirei "i T 


Appendix С 619 


Thus, if a distribution of scores on a specific test has а mean of 36 and a 
standard deviation of 4, one would enter on the baseline of the normal curve 
à raw score of 36 at the zero point or mean; a raw score of 40 (36 + 4) at 
à baseline position one unit to the right (2-15); a raw score of 44 (36 + 8) 
at a position two units to the right (+20); a raw score of 32 (36 — 4) at a 
baseline position one unit to the left (—15); and the like. Once these entries 
have been made, one can read off intermediate standard-score values for all 
raw scores; for example, a raw score of 38 would be +.5; a raw score of 41 
would be 1.25; a raw score of 34, —.5; and the like. 

In using the chart to determine the percentile ranks equivalent to certain 
Taw scores, one makes use of the fact that the total area under the curve 
represents the total number of cases (in this case, test scores) in the frequency 
distribution. Vertical lines have been drawn through the score scale (the base- 
line) at the zero point and at points 1, 2, 3, and 4 standard-deviation units 
to the right (above the mean) and to the left (below the mean). These lines 
mark off subareas of the total area under the curve; that is, they mark off the 
score limits for certain percentages of cases in a normal frequency distribu- 
tion. The numbers printed in these subareas indicate the percentages of 
students with scores falling within the specified limits, For example, 34.13 
percent of all the students in a normally distributed group have scores falling 
between the mean (0) and --le. The fact that 68.26 percent (or approxi- 
mately two thirds) of the cases fall between +-1с and —1с in a normal dis- 
tribution was emphasized in Chapter 2. 

Just below the standard-deviation scale on the chart is a row of percentages. 
This shows the cumulated percentages to the left or below each of the posi- 
tions on the baseline. Thus, starting from the left, one sees that 0.1 percent 
of the individuals in a normal distribution have raw scores that would place 
them below —3o; 2.3 percent would be below —26; 16 percent, below — 10; 
50 percent, below 0, or the mean; 84 percent, below 4-16; and the like. The 
rounded values for these percentages are shown in the next row. 

The next scale below the chart is for percentile ranks or percentile equiva- 
lents. If the lines for each of these percentile ranks were extended upward, 
through the area of the normal curve, one could visualize the principle, 
Stressed on page 30, that percentile ranks near the average—for example, 
55 and 65—represent almost identical raw scores, even though they are 10 
percentile ranks apart; whereas a similar difference near either extreme of 
the distribution (for example, the difference between percentile ranks of 1 and 
10 or of 90 and 99) represents a much larger difference in raw scores. One 
can readily see from the chart that 10 percent of the area (students) near 
the middle of the distribution includes a smaller baseline distance (and there- 
fore smaller difference in raw scores) than 10 percent of the area (students) 
near either end of the curve. 

In the remainder of the chart, several widely used systems of standard 


620 APPENDIXES 


scores are included. First are the original standard scores, or z-scores. The 
numbers in this scale are the same as those on the baseline of the graph 
except that the term с has been omitted. As illustrated above, anyone can 
translate test scores into z-scores by setting the mean for the group equal to 
0.0 and the standard deviation equal to 1. The formula for z listed on page 
21 can be used in translating raw scores into standard-score equivalents. 
T-scores (one of the most widely used systems) equate the mean of the raw- 
score distribution with 50 and the standard deviation with 10. Thus, a z-score 
of 1.5 is equivalent to a T-score of 65 (1% standard-deviation units of 10 
points above an arbitrary mean of 50). The College Entrance Examination 
Board avoids the use of decimals by equating the mean raw score on all its 
tests with 500 points and the standard deviation with 100 points. Hence, the 
experienced counselor thinks of a College Board score of 550 as one-half 
standard deviation (of 100 points) above the average (500 points) on the 
CEEB basic norms. On the Army General Classification Test, a mean of 100 
points and a standard deviation of 20 points are used. 

Stanine scores (developed by the Air Force and now widely used) derive 
their name from the fact that they divide the baseline of the normal curve 
into nine groups; hence "standard nines," or stanines. Except for stanines 1 
and 9, at either extreme, these groups are spaced in units of one-half standard 
deviation. The Wechsler Scale values are used only by psychologists administer- 
ing this individual intelligence test. The deviation IQ's on the last line, however, 
are more widely used and should be understood by all teachers. 

A large number of interrelationships can be read from this chart. For ex- 
ample, the percentile ranks for any deviation IQ, standard score, T-score, and 
the like can be readily determined. If the baseline is divided into somewhat 
smaller units and a straight-edge rule is constructed, showing the mean and the 
values of +10, +20, and the like for a Specific distribution of test scores, 
approximate standard-score equivalents or percentile ranks can be read off 
for any raw-score value. In interpreting research studies in which stanine 
Scores or T-scores are used, the reader can readily translate these measures into 
the more familiar standard scores or percentile ranks, 


APPENDIX D Selected Tables 


Table A.1 
Equivalent Standard Scores and Percentile Scores in a Normal Distribution 


о ин 


Equivalents for Scores Equivalents for Scores 
at or Above Mean at or Below Mean 
DISTANCE OF X DISTANCE OF X 
FROM MEAN PERCEN- PERCEN- FROM MEAN 
IN SD's T-SCALED TILE TILE T-SCALED IN SD's 
(Z-SCORE) SCORE SCORE SCORE SCORE (Z-SCORE) 
ENEE 
3.0 80 99.9 0.1 20 —3.0 
2.9 79 99.8 0.2 21 —2.9 
2.8 78 99.7 0.3 22 —2.8 
27 77 99.6 0.4 23 = 7 
2.6 76 99:5 0.5 24 —2.6 
2.5 75 99.4 0.6 25 -2:5 
2.4 74 99.2 0.8 26 —2.4 
2.3 73 99 1 27 =>83 
2.2 72 99 1 28 —2.2 
21 71 98 2 29 —2.1 
2.0 70 98 2 30 —2.0 
1.9 69 97 3 31 —1.9 
1.8 68 96 4 32 —1.8 
1.7 67 96 4 33 —17 
1.6 66 95 5 34 —1.6 
1.5 65 93 1 35 —1.5 
1.4 64 92 8 36 —1.4 
1.3 63 90 10 37 —1.3 
12 62 88 12 38 =12, 
11 61 86 14 39 —1.1 
1.0 60 84 16 40 —1.0 
0.9 59 82 18 41 —0.9 
0.8 58 79 21 42 —0.8 
0.7 57 76 24 43 —0.7 
0.6 56 73 27 44 —0.6 
0.5 55 69 31 45 =0.5) 
0.4 54 66 34 46 —0.4 
0.3 53 62 38 47 —0.3 
0.2 52 58 42 48 —0.2 
0.1 51 54 46 49 —0.1 
0.0 50 50 50 50 0.0 


———————————— 


621 


622 APPENDIXES 


Table A.2 


Effect of Lengthening a Test on Its Reliability as Predicted 
by the Spearman-Brown Formula* 


————————————Á——————— ÁMÀà 


PREDICTED RELIABILITY WHEN 


PRESENT LENGTH OF TEST IS MULTIPLIED BY 
RELIABILITY 
COEFFICIENT 2 3 4 


70 823 .875 903 
72 837 .885 911 
74 851 .895 919 
76 864 905 927 
78 876 915 934 
80 888 .923 941 
82 901 932 948 
84 913 -940 955 
86 925 948 960 
88 936 .956 967 
90 947 964 973 
92 958 .972 979 
94 969 979 984 


пг 

1+ (п = ђе 

The meanings of symbols are as follows: г = known reliability coefficient; п == number ef 
times the test whose reliability is to be estimated is longer than the one whose reliability is 
known; га = estimated reliability coefficient for test of increased length. 


“The Spearman-Brown formula is: rh = 


5СОКЕ 


INTERVAL 


Appendix D 623 


Table A.3 


Computation of Mean and Standard Deviation 
(Data: Scores on State History Test for School District) 


—————————————————— 


f d fd fd? 


DIRECTIONS FOR COMPUTING MEAN 
AND SD 


a ==. 


96-98 11 +8 + 88 704 1. In each row, multiply the f and d 
93-95 21 +7 +147 1029 values and enter the products in the 
90-92 30 + 6 +180 1080 fd column. 
87-89 25 +5 +125 625 2. In each row, multiply the f and the 
84-86 29 +4 +116 464 fd values and enter the products in 
81-83 35 +3 +105 315 the fd? column. 
78-80 40 +2 + 80 160 3. Add the fd column to obtain Xfd. 
75-77 40 +1 + 40 40 Add the fd? column to obtain 2742. 
72-74 48 0 (+881) 4. Find the correction (in intervals) 
69-71 50 =i = 20 50 à Efd 82 
66-68 40 —2 -— 80 160 c (correction) — У = 388 = .164 
e le c: "E et 5. Substitute values obtained in for- 
ps 59 26 5 —130 650 mulas for the mean and standard 
54—56 i? ==6 =102 612 deviation. 
51-53 12 —7 — 84 588 
48-50 8 =% — 64 512 
45-47 3 r9 f 243 
42-44 2 =10 -— 9390 200 
39-41 2 -1 —22 242 
36-38 1 -—12 — 12 144 
N = 500 (—799) 8554 
+ 82 
(хуа) (2fd*) 
M (Mean) = AM + (c)(i) = 73 + (.164)(3) = 73 + .492 = 73.492 
5р = 28. —-¢=3X e — (164) = 3\/17.1080 — .0269 = 3\/17.0811 


| 


3 (4.133) = 12.399 = 12.4 


AM is midpoint of the score interval chosen as the arbitrary origin. 


624 APPENDIXES 


SCORES ON ODD-NUMBERED QUESTIONS (A^ VARIABLE ) 


-i5| =18 | -21 
715| -2o[ -25| -30| 


os | 2 56 |48 am 


SCORES ON EVEN- NUMBERED QUESTIONS (Y VARIABLE) 


+ 
Ee 
1 
a 
Ц 
A^ 


-io|-ie |-57 

m У | ss [ees [so 
X Joi Е 
E XE 


Fig. A.2 Computation of a Reliability Coefficient (split-halves 


method) by the Use of the Pearson Product-Moment Method of 
Correlation. 


COMPUTATION OF PEARSON ř BY THE FORMULA FOR GROUPED DATA The 


usual method of computing r from a scatter diagram involves the solution of 
the following formula: 


Zx'y' 
y би 
r= 
SD’,SD’, 


in which N is the number of cases; c’, is the correction value (in interval 
units) used in computing the mean of x; С'у is the corresponding value fof 
variable y; SD’, is the standard deviation for x (in Ени ): SD! is the 
corresponding value for y; and x’ and y’ are deviations Кын AM (assume 
mean) in terms of interval units. The reader has already met most of these 
terms and symbols since they were used in Table 2.2 x tis лан ы of 


the mean and standard deviation. The new terms (x 


та Ө 
2, and Zy, except that x’ and y’) are simila 


f ME 
and у” are deviations computed from an assume 


Appendix D 625 


mean (in terms of interval units), rather than deviations from the actual 
mean (in terms of SD units). 

The reader will recognize that the first term in the numerator is similar to 
the standard-score or z-score formula for r, used in Table 3.3; that is, it is 
the average of the products of paired deviations. The other terms must be 
introduced into this more elaborate formula because we are no longer work- 
ing with standard scores. The second term in the numerator (c’,c’,) is neces- 
загу because we are working with deviations from an assumed mean, rather 
than deviations from the actual mean. The denominator (SD'.SD',) is neces- 
ѕагу since the x’ values and у” values are not in standard-score form. 

Since r involves a measure of relationship and its value does not depend 
on the size of the original scores, the corrections and the standard deviations 
are all left in interval units to simplify computation. As in the short methods 
of computing the mean and standard deviation, all scores in a single square 
(cell) of the scatter-diagram are treated as if they fell at the midpoint of the 
interval. 

Application of the formula above involves seven major steps (illustrated 
in Figure A.2 and the Computing Guide). 


Computation Guide for the Pearson Product-Moment Method of Computing r 


1. Tally each pair of scores in a correlation table or scatter diagram (Fig. A.2). 

Write in each cell the total number of tallies for that cell. Add the fre- 

quencies (number of tallies) for each row and enter in the f, column. Add 

the frequencies for each column and enter in the 7. row. Total the f, and 

f, values to obtain N. 

Following the short method of computing the mean (shown in Tables 2.2 

and 3.3), compute for both the x and y variables the Zfd value and the 

correction (in intervals). A subscript is used to denote the 2/4 value and 
the correction for each variable; for example, the Zfd value for x is written 

2fd,. Correction for x is written c’,. 

3. Following the short method for computing the standard deviation (shown 
in Table A.3), compute the fd? value and the standard deviation (SD) 
for each variable, leaving the SD in terms of intervals. 

zxy 


N 


4. Compute the average product moment, 
cedures: 


, using the following pro- 


à. The frequency in each cell of the scatter diagram is multiplied by the 
small numeral printed in the cell. This numeral indicates the product 
moment for that cell (a product of 4,, representing the distance or 
deviation in intervals from the x axis, and d,, representing the distance 
in intervals from the y axis). In this way, the x’y’ product for each 
cell is obtained, 


626 APPENDIXES 


b. The positive x’y’ products and the negative x’y’ products are totaled for 
each row. The partial sums for each row are added algebraically to 
obtain =x’y’. 

с. As a cross-check, the positive and negative x’y’ products are totaled for 
each column. The partial sums for each column are added algebraically 
to obtain Zx/y' (which should verify the value obtained above). 

d. The verified Zx^j' value is divided by N (the number of cases). 


5. The product of the corrections (in intervals) for x and у (c’,c’,) is com- 
zxy’ 


puted and subtracted from N to obtain the numerator for the formula. 


6. The product of the two standard deviations (in interval units) is computed 
to obtain the denominator for the formula. 


. The final division of numerator by denominator is made to obtain the value 
of r. If the sign is negative, the relationship is an inverse one. 


APPENDIX E 


А Glossary of 100 Measurement 
Terms 


by Roger T. Lennon! 


This glossary of technical terms used in educational and psychological measure- 
ment is primarily for persons with limited training in measurement, rather than 
for the specialist. The terms defined are the more common or basic ones such 
as occur in test manuals and simple research reports. In the definitions, niceties 
of usage have sometimes been sacrificed for the sake of brevity and, it is 
hoped, clarity. 

The definitions are based on study of the definitions and usages of the vari- 
ous terms in about a dozen widely used textbooks in educational and psycho- 
logical measurement and statistics, and in both general and specialized 
dictionaries. There is not complete uniformity among writers in the measure- 
ment field with respect to the usage of certain technical terms; in cases of 
varying usage, either these variations are noted or the definition offered is the 
one that the writer judges to represent the “best” usage. 


academic aptitude. The combination of native and acquired abilities that is 
needed for school work; likelihood of success in mastering academic work, 
as estimated from measures of the necessary abilities. (Also called 


scholastic aptitude.) 


accomplishment quotient (AQ). The ratio of educational age to mental age; 
EA = MA. (Also called achievement quotient.) 


achievement age. The age for which a given achievement test score is the real 
or estimated average. (Also called educational age or subject age.) If the 
achievement age corresponding to a score of 36 on a reading test is 10 
years, 7 months (10-7), this means that pupils 10 years, 7 months 
achieve, on the average, a score of 36 on that test. 


achievement fest. A test that measures the extent to which a person has 
"achieved" something—acquired certain information or mastered certain 
skills, usually as a result of specific instruction. 


1 Published as Test Service Notebook No. 13 (New York: Harcourt, Brace & World, 
Inc.). Compiled with the assistance of Claude F. Bridges, John C. Marriott, Frances E. 
Crook, and Blythe C. Mitchell, Division of Test Research and Service. Reprinted 
with the permission of Harcourt, Brace & World, Inc. 


627 


628 APPENDIXES 


age equivalent. The age for which a given score is the real or estimated average 
Score. 


age norms. Values representing typical or average performance for persons of 
various age groups. 


age-grade table. A table showing the number or per cent of pupils of various 
ages in each grade; a distribution of the ages of pupils in successive grades. 


alternate-form reliability. The closeness of correspondence, or correlation, ђе- 
tween results on alternate (i.e. equivalent or parallel) forms of a test; thus, 
a measure of the extent to which the two forms are consistent or reliable 
in measuring whatever they do measure, assuming that the examinees 
themselves do not change in the abilities measured between the two test- 
ings. (See RELIABILITY, RELIABILITY COEFFICIENT, STANDARD ERROR.) 


aptitude. A combination of abilities and other characteristics, whether native 
or acquired, known or believed to be indicative of an individual's ability 
to learn in some particular area. Thus, "musical aptitude" would refer 
broadly to that combination of physical and mental characteristics, moti- 
vational factors, and conceivably other characteristics, which is conducive 
to acquiring proficiency in the musical field. Some exclude motivational 
factors, including interests, from the concept of "aptitude," but the more 
comprehensive use seems preferable. The layman may think of "aptitude" 
as referring only to some inborn capacity; the term is no longer so re- 
stricted in its psychological or measurement usage. 


arithmetic mean. The sum of a set of scores divided by the number of scores. 
(Commonly called average, mean.) 


average. A general term applied to measures of central tendency. The three 
most widely used averages are the arithmetic mean, the median, and the 
mode. 


battery. A group of several tests standardized on the same population, so that 


results on the several tests are comparable. Sometimes loosely applied to 
any group of tests administered together, even though not standardized on 


the same subjects. 
ceiling. The upper limit of ability measured by a test. 


class analysis chart. A 
achievement tests, 
class on the several 


Coefficient of Correlatio; 
togetherness," bet 


chart, usually prepared in connection with a battery of 


that shows the relative performance of members of а 
parts of the battery. 


п (г). A measure of the degree of relationship, or “going- 
viduals. The cor: i two sets of measures for the same group of iar 
and educational + 2нод coefficient most frequently used in test developmen 
named for Ka perm: is that known as the Pearson (Pearsonian) r, 50 
тодар е te Е. еагѕоп, originator of the method, or as the product- 
otherwise specin CE the mathematical basis of its calculation. Unless 

n pecified, “correlation” usually means the product-moment corre- 


from .00, denoting complete absence of 


positive or negative erfect correspondence, and may be either 


Аррепах Е 629 


completion item. A test question calling for the completion (filling in) of a 
phrase, sentence, etc., from which one or more parts have been omitted. 


correction for guessing. A reduction in score for wrong answers, sometimes 
applied in scoring true-false or multiple-choice questions. Many question 
the validity or usefulness of this device, which is intended to discourage 
guessing and to yield more accurate rankings of examinees in terms of 
their true knowledge. Scores to which such corrections have been applied— 
e.g. rights minus wrongs, or rights minus some fraction of wrongs—are 
often spoken of as "corrected for guessing" or "corrected for chance." 


correlation. Relationship or “going-togetherness” between two scores or 
measures; tendency of one score to vary concomitantly with the other, as 
the tendency of students of high IQ to be above average in reading ability. 
The existence of a strong relationship—i.e., a high correlation—between 
two variables does not necessarily indicate that one has any causal influ- 
ence on the other. (See COEFFICIENT OF CORRELATION.) 


criterion. A standard by which a test may be judged or evaluated; a set of 
scores, ratings, etc., that a test is designed to predict or to correlate with. 
(See VALIDITY.) 


decile. Any one of the nine percentile points (scores) in a distribution that 
divide the distribution into ten equal parts; every tenth percentile. The 
first decile is the 10th percentile, the ninth decile the 90th percentile, etc. 


deviation. The amount by which a score differs from some reference value, 
such as the mean, the norm, or the score on some other test. 


deviation IQ. See INTELLIGENCE QUOTIENT. 


diagnostic test. A test used to “diagnose,” that is, to locate specific areas of 
weakness or strength, and to determine the nature of weaknesses or 
deficiencies; it yields measures of the components or sub-parts of some 
larger body of information or skill. Diagnostic achievement tests are most 
commonly prepared for the skill subjects—reading, arithmetic, spelling. 


difficulty value. The per cent of some specified group, such as students of a 
given age or grade, who answer an item correctly. 


discriminating power. The ability of a test item to differentiate between persons 
possessing much of some trait and those possessing little. 


distractor. Any of the incorrect choices in a multiple-choice or matching item. 


distribution (frequency distribution). A tabulation of scores from high to low, 
or low to high, showing the number of individuals that obtain each score 
or fall in each score interval. 


educational age (EA). See ACHIEVEMENT AGE. 


equivalent form. Any of two or more forms of a test that are closely parallel 
with respect to the nature of the content and the difficulty of the items 
included, and that will yield very similar average scores and measures of 
variability for a given group. 


error of measurement. See STANDARD ERROR. 


630 APPENDIXES 


extrapolation. In general, any process of estimating values of a function Барана 
the range of available data. As applied to test norms, the process о 
extending a norm line beyond the limits of actually obtained data, in 
order to permit interpretation of extreme scores. This extension may be 
done mathematically by fitting a curve to the obtained data or, as is more 
common, by less rigorous methods, usually graphic. See Fig. 1. Con- 
siderable judgment on the test maker's part enters into any extrapolation 


process, which means that extrapolated norm values are likely to be to 
some extent arbitrary. 


100 


Interpolated section 
(between obtained points) 


Illustrating 
assignment of 
grade equiv. of 
60 | 3.4 to score 


Extrapolated 
portions of 


• = median scores 
of pupils tested in 


standardization Grades 
1 
| dll 18,28, etc. throug 
2.0 30 40 5.0 60 70 8.0 90 10.0 
Grade Placement 


Fig. 1 


factor. In mental measurement, a hypothetical trait, ability or component of 
ability, that underlies and influences performance on two or more tests, 
and hence causes scores on the tests to be correlated. The term “factor 


strictly refers to a theoretical variable, derived by a process of factor 
analysis, from a tabl 


4 e of intercorrelations among tests; but it is also com- 
monly used to denote the psychological interpretation given to the variable 
—е., the mental 


e trait assumed to be represented by the variable, as 
verbal ability, numerical ability, etc. 


Аррепах Е 631 


much of the variation in each of the original measures arises from, or is 
associated with, each of the hypothetical factors. Factor analysis has con- 
tributed to our understanding of the organization or components of intel- 
ligence, aptitudes, and personality; and it has pointed the way to the 
development of “purer” tests of the several components. 


forced-choice item. Broadly, any multiple-choice item in which the examinee 
is required to select one or more of the given choices. The term is best 
used to denote a special type of multiple-choice item, in which the options, 
or choices, are (1) of equal "preference value"—i.c., chosen equally 
often by a typical group, but (2) of differential discriminating ability— 
ie. such that one of the options discriminates between persons high and 
low on the factor that this option measures, while the other options do not. 


frequency distribution. See DISTRIBUTION. 


grade equivalent. The grade level for which a given score is the real or estimated 
average. 


grade norm. The average score obtained by pupils of given grade placement. 
See NORMS, MODAL AGE. 


group test. A test that may be administered to a number of individuals at the 
same time by one examiner. 


individual test. A test that can be administered to only one person at a time. 


intelligence quotient (10). Originally, the ratio of a person’s mental age to his 
chronological age (34 ) or, more precisely, especially for older 
persons, the ratio of mental age to the mental age normal for chronological 
age (in both cases multiplied by 100 to eliminate the decimal). More gen- 
erally, IQ is a measure of brightness that takes into account both score 
on an intelligence test and age. A deviation IQ is such a measure of bright- 
ness, based on the difference or deviation between a person's obtained 
score and the score that is normal for the person's age. 
The following table shows the classification of IQ's offered by Terman 
and Merrill for the Stanford-Binet test, indicating the per cent of persons 
in a normal population who fall in each classification. This table is roughly 
applicable to tests yielding IQ's having standard deviations of about 16 
points (not all do). It is important to bear in mind that any such table 
is arbitrary, for there are no inflexible lines of demarcation between 
“feeble-minded” and "borderline," etc. 
Per cents of 


Classification IQ all persons 
Near genius or genius 140 and above 1 
Мегу зирепог 130–139 2.5 
Superior 120-129 8 
Above average 110-119 16 
Normal or average 90-109 45 
Below average 80-89 16 
Dull or borderline 70-79 8 
Feeble-minded: moron, 60-69 2.5 


imbelicile, idiot 59 and below 1 


632 APPENDIXES 


interpolation. In general, any process of estimating intermediate values between 
two known points. Аз applied to test norms, it 
refers to the procedure used in assigning inter- 


preted values (e.g., grade or age equivalents) to Age 
scores between the successive average scores actu- Score Equiv. 
ally obtained in the standardization process. In 120 12-6 
reading norm tables, it is necessary at times to 115 12-4 
interpolate to obtain a norm value for a score be- 110. 12-2 


tween scores given in the table; e.g., in the table 
given here, an age value of 12-5 would be as- 


signed, by interpolation, to a score of 118. See Fig. 1. under EXTRA- 
POLATION. 


inventory test. As applied to achievement tests, a test that attempts to cover 
rather thoroughly some relatively small unit of specific instruction or 
training. The purpose of an inventory test, as the name suggests, is more 
in the nature of a "stock-taking" of an individual's knowledge or skill 
than an effort to measure in the usual sense. The term sometimes denotes 
a type of test used to measure achievement status prior to instruction. 


Many personality and interest questionnaires are designated "inventories," 
since they appraise an individual's status in several personal characteristics, 
or his level of interest in a variety of types of activities. 


item. A single question or exercise in a test. 


item analysis. The process of evaluating single test items by any of several 
methods. It usually involves determining the difficulty value and the dis- 
criminating power of the item, and often its correlation with some criterion. 


Kuder-Richardson formula(s). Formulas for estimating the reliability of a test 
from information about the individual items in the test, or from the mean 
Score, standard deviation, and number of items in the test. Because the 
Kuder-Richardson formulas permit estimation of reliability from a single 
administration of a test, without the labor involved in dividing the test 
into halves, their use has become common in test development. The 


Kuder-Richardson formulas are not appropriate for estimating the 
reliability of speeded tests, 


machine-scorable (machine 
of a machine. Ordina 
the International Te: 
Business Machines 
this machine, the e 
with a special ele 
conductive, and c 
calibrated dial as 


"Scored) test. A test that may be scored by means 
rily, the term refers to a test adapted for scoring on 
st Scoring Machine, manufactured by International 
Corporation. In taking tests that are to be scored on 
xaminee records his answers on separate answer sheets 
ctrographic pencil. These pencil marks are electrically 
urrent flowing through them may be read on a suitably 
а test score. The machine distinguishes, by means of 
} Wrong answers, and can combine 
ог corrected ecoree ler to yield total or Part scores, weighted scores, 


matching item. A test item 


] . calling for the corre 
one list with an entry in ~ pem 


à second list. 
mean. See ARITHMETIC MEAN, 


ciation of each entry in 


Appendix E 633 


median. The middle score in a distribution; the 50th percentile; the point that 
divides the group into two equal parts. Half of the group of scores fall 
below the median and half above it. 


mental age (MA). The age for which a given score on an intelligence test is 
average or normal. If a score of 55 on an intelligence test corresponds 
to a mental age of 6 years, 10 months, then 55 is presumably the average 
score that would be made by an unselected group of children 6 years, 10 
months of age. С 


modal age. That age or age range which is most typical or characteristic of 
pupils of specified grade placement, 


modal-age norms. Norms based on the performance of pupils of modal age 
for their respective grades, which are thus free of the distorting influence 
of under-age or over-age pupils. 


mode. The score or value that occurs most frequently in a distribution. 


multiple-choice item. A test item in which the examinee's task is to choose the 
correct or best answer from several given answers, or options. 


multiple-response item. A special type of multiple-choice item in which two 
or more of the given choices may be correct. 


N. The symbol commonly used to represent the number of cases in a distribu- 
tion, study, etc. 


normal distribution. A distribution of scores or measures that in graphic form 
has a distinctive bell-shaped appearance. Figure 2 shows such a graph of a 
normal distribution, known as a normal curve or normal probability curve. 
In a normal distribution, scores or measures are distributed symmetrically 
about the mean, with as many cases at various distances above the mean 
as at equal distances below it, and with cases concentrated near the aver- 
age and decreasing in frequency the further one departs from the average, 
according to a precise mathematical equation. The assumption that mental 
and psychological characteristics are distributed normally has been very 
useful in much test development work. 


Frequency 


157, M *1S.D. 
<— Variable ————— ——- 


Fig. 2 


634 APPENDIXES 


norm line. А smooth curve drawn through the mean or median scores of suc- 
cessive age or grade groups, or through percentile points for a single 
group. See Fig. 1 under EXTRAPOLATION. 


norms. Statistics that describe the test performance of specified groups, such 
as pupils of various ages or grades in the standardization group for a test. 
Norms are often assumed to be representative of some larger population, 
as of pupils in the county as a whole. Norms are descriptive of average, 
typical, or mediocre performance; they are not to be regarded as standards, 
or as desirable levels of attainment. Grade, age, and percentile are the 
most common types of norms. 


objective test. A test in the scoring of which there is no possibility of differ- 
ence of opinion among scorers as to whether responses are to be scored 
right or wrong. It is contrasted with a "subjective" test—e.g., the usual 


essay examination to which different scorers may assign different scores, 
ratings, or grades. 


omnibus test. A test (1) in which items measuring a variety of mental opera- 
tions are all combined into a single sequence rather than being grouped 
together by type of operation, and (2) from which only a single score is 
derived, rather than separate scores for each operation or function. Omni- 
bus tests make for simplicity of administration: one set of directions and 
one over-all time limit usually suffice. Otis Quick-Scoring Mental Ability 
Tests: Beta or Gamma Tests are omnibus-type tests, as distinguished from 
tests such as Terman-McNemar Test of Mental Ability or Pintner General 
Ability Tests: Verbal, in which the items measuring various operations 
are grouped together, each with its own set of directions. 


percentile (P). A point (score) in a distribution below which falls the per cent 
of cases indicated by the given percentile. Thus the 15th percentile denotes 
the score or point below which 15 per cent of the scores fall. "Percentile" 
has nothing to do with the per cent of correct answers 


an examinee has 
On а test. 


percentile rank. The per cent of scores in a distribution equal to or lower than 
the score corresponding to the given rank. 


Performance test. As contrasted with paper-and-pencil tes 
motor or manual response on the examinee’s part, generally but not 
always involving manipulation of concrete equipment or materials. 
Cornell-Coxe Performance Ability Scale, Arthur Point Scale of Perform- 
ance Tests, and Bennett Hand-Tool Dexterity Test are performance tests, 
in this sense. “Performance test” is also used in another sense to denote 
a test that is actually a work-sample, and in this sense it may include 
paper-and-pencil tests, as, for example, a test in accountancy, ee taking 
Shorthand, or in Proofreading, where no materials other than aper and 
pencil may be required, but where the test response is identi 1 vith the 
behavior about which information is desired, adeo 


1, a test requiring 


(e.g, Heston Personal 


Adjustment Inventor м 
ventory, Bell Адјичте », Bernreuter Personality In- 


nt Inventory) which seek to measure a person's 


Appendix E 635 


status on such traits as dominance, sociability, introversion, etc., by 
means of self-descriptive responses to a series of questions; rating scales 
(e.g., Haggerty-Olson-Wickman Behavior Rating Schedules) which call 
for rating, by one's self or another, of the extent to which a subject pos- 
sesses certain characteristics; situation tests in which the individual's 
behavior in simulated life-like situations is observed by one or more 
judges, and evaluated with reference to various personality traits; and 
opinion or attitude inventories (e.g., Allport-Vernon Study of Values). 
Some writers also classify interest inventories as personality tests. 


power test. A test intended to measure level of performance rather than speed 
of response; hence one in which there is either no time limit or a very 
generous one. 


practice effect. The influence of previous experience with a test on a later 
administration of the same test or a similar test; usually, an increase 
in the score on the second testing, attributed to increased familiarity with 
the directions, kinds of questions, etc. Practice effect is greatest when the 
interval between testings is small, when the materials in the two tests are 
very similar, and when the initial test-taking represents a relatively novel 
experience for the subjects. 


probable error. See STANDARD ERROR, 


product-moment coefficient. See COEFFICIENT OF CORRELATION. 


profile. A graphic representation of the results on several tests, for either an 
individual or a group, when the results have been expressed in some 
uniform or comparable terms. This method of presentation permits easy 
identification of areas of strength or weakness. 


projective technique (projective method). A method of personality study in 
which the subject responds as he chooses to a series of stimuli such as 
ink-blots, pictures, unfinished sentences, etc. So called because of the 
assumption that under this free-response condition the subject “projects” 
into his responses manifestations of personality characteristics and organ- 
ization that can, by suitable methods, be scored and interpreted to yield 
a description of his basic personality structure. The Rorschach (ink-blot) 
Technique and the Murray Thematic Apperception Test are the most 
commonly used projective methods. 


prognosis (prognostic) test. A test used to predict future success or failure in 
a specific subject or field. 


quartile. One of three points that divide the cases in a distribution into four 
equal groups. The lower quartile, or 25th percentile, sets off the lowest 
fourth of the group; the middle quartile is the same as the 50th percentile, 


or median; and the third quartile, or 75th percentile, marks off the 
highest fourth. 


r. See COEFFICIENT OF CORRELATION. 


random sample. A sample of the members of a population drawn in such a 
way that every member of the population has an equal chance of bein 
included—that is, drawn in a way that precludes the operation of bias ~ 
selection. The purpose in using a sample thus free of bias is, of E 

, 


636 APPENDIXES 


that the sample be fairly "representative" of the total population, so that 
sample findings may be generalized to the population. A great advantage 
of random samples is that formulas are available for estimating the 
expected variation of the sample statistics from their true values in the total 
population; in other words, we know how precise an estimate of the popu- 
lation value is given by a random sample of any given size. 


range. The difference between the lowest and highest scores obtained on a test 
by some group. 


raw score. The first quantitative result obtained in scoring a test, Usually the 
number of right answers, number right minus some fraction of number 
wrong, time required for performance, number of errors, or similar 
direct, unconverted, uninterpreted measure. 


readiness test. A test that measures the extent to which an individual has 
achieved a degree of maturity or acquired certain skills or information 
needed for undertaking successfully some new learning activity. Thus à 
reading readiness test indicates the extent to which a child has reached 


a developmental stage where he may profitably begin a formal instruc- 
tional program in reading. 


recall item. An item that requires the examinee to supply the correct answer 
from his own memory or recollection, as contrasted with a recognition 
item, in which he need only identify the correct answer. 
e.g., “Columbus discovered America inthe year ?  " 
is a recall item, whereas 
"Columbus discovered America in a 1425 b 1492 c 1520 d 1546" 
is a recognition item. 


recognition item. An item requiring the examinee t 


о recognize or select the 
correct answer from 


among two or more given answers, See RECALL ITEM. 
reliability. The extent to which a test is consistent in measuring whatever it 
does measure; dependability, stability, relative freedom from errors of 


measurement. Reliability is usually estimated by some form of reliability 
coefficient or by the standard error of measurement. 


reliability coefficient. The coefficient of correlation between two forms of à 
test, between scores on repeated administrations of the same test, ОГ 
between halves of a test, properly corrected. These three coefficients 
measure somewhat different aspects of reliability but all are properly 


spoken of as reliability coefficients. See ALTERNATE-FORM RELIABILITY, 


SPLIT-HALF COEFFICIENT, TEST-RETEST COEFFICIENT, KUDER-RICHARDSON 
FORMULA(S). 


representative sample. A sample that corres; 
of which it is a Sample with ге 
purposes under investigation. e 
Proportion of pupils from each 
gated and Don-segregated schools, 


scholastic aptitude. 


ponds to or matches the population 
spect to characteristics important for the 
-g., in an achievement test norm sample, 
state, from various regions, from segre- 
etc, 
See ACADEMIC APTITUDE, 
skewness. The tendency of а distributi 
on to depart fri ance 
around the mean, и о qe ЫШ 


Appendix E 637 


sociometry. Measurement of the interpersonal relationships prevailing among 
the members of a group. By means of sociometric devices, e.g., the 
sociogram, an attempt is made to discover the patterns of choice and 
rejection among the individuals making up the group—which ones аге 
chosen most often as friends or leaders ("stars"), which are rejected by 
others ("isolates"), how the group subdivides into clusters or cliques, etc. 


Spearman-Brown formula. A formula giving the relationship between the 
reliability of a test and its length. The formula permits estimation of the 
reliability of a test lengthened or shortened by any amount, from 
the known reliability of a test of specified length. Its most common 
application is in the estimation of reliability of an entire test from the 
correlation between two halves of the test (split-half reliability). 


split-half coefficient. A coefficient of reliability obtained by correlating scores 
on one half of a test with scores on the other half. Generally, but not 
necessarily, the two halves consist of the odd-numbered and the even- 
numbered items. 


standard deviation (S.D.). A measure of the variability or dispersion of a set 


of scores. The more the scores cluster around the mean, the smaller the 
standard deviation. 


standard error (S.E.). An estimate of the magnitude of the "error of measure- 
ment" in a score—that is, the amount by which an obtained score differs 
from a hypothetical true score. The standard error is an amount such that 
in about two-thirds of the cases the obtained score would not differ by 
more than one standard error from the true score. The probable error 
(P.E.) of a score is a similar measure, except that in about half the cases 
the obtained score differs from the true score by not more than one 
probable error. The probable error is equal to about two-thirds of 
the standard error. The larger the probable or the standard error of a 
score, the less reliable the measure. 


standard score. A general term referring to any of a variety of "transformed" 
scores, in terms of which raw scores may be expressed for reasons of 
convenience, comparability, ease of interpretation, etc. 


The simplest type of standard score is that which expresses the deviation 
of an individual's raw score from the average score of his group in relation 
to the standard deviation of the scores of the groups. Thus: 


raw score (X) — mean (M) 
standard deviation (S.D.) 


By multiplying this ratio by a suitable constant and by adding or sub- 
tracting another constant, standard scores having any desired mean and 
standard deviation may be obtained. Such standard scores do not affect 
the relative standing of the individuals in the group nor change the shape 
of the original distribution. 


Standard score (z) — 


More complicated types of standard scores may yield distributions differ- 
ing in shape from the original distribution; in fact, they are sometimes 
used for precisely this purpose. Normalized standard scores and K-scores 
(as used in Stanford Achievement Test) are examples of this latter group. 


638 APPENDIXES 


standardized test (standard test). A systematic sample of performance obtained 
under prescribed conditions, scored according to definite rules, and capable 
of evaluation by reference to normative information. Some writers restrict 
the term to tests having the above properties, whose items have been 
experimentally evaluated, and/or for which evidences of validity and 
reliability are provided. 


stanine. One of the steps in a nine-point scale of normalized standard scores. 
The stanine (short for standard-nine) scale has values from 1 to 9, with 
a mean of 5, and a standard deviation of 2. 


stencil key. A scoring key which, when positioned over an examinee's re- 
sponses either in a test booklet or, more commonly, on an answer sheet, 
permits rapid identification and counting of all right answers. Stencil keys 
may be perforated in positions corresponding to positions of right answers, 
so that only right answers show through when the keys are in place; or 
they may be transparent, with positions of right answers identified by 
circles, boxes, etc., printed on the key. 


strip key. A scoring key arranged so that the answers for items on any page 


or in any column of the test appear in a strip or column that may be 
placed alongside the examinee's responses for easy scoring. 


survey test. A test that measures general achievement in a given subject or 
area, usually with the connotation that the test is intended to measure 
group status, rather than to yield precise measures of individuals. 


test-retest coefficient. A type of reliability coefficient obtained by administering 
the same test a second time after a short interval and correlating the two 
sets of scores. 


true-false item. A test question or exercise in which the examinee's task is to 
indicate whether a given statement is true or false, 


true score. A score entirely free of errors of measurement. True scores are 
hypothetical values never obtained by testing, which always involves some 
measurement error. A true score is sometimes defined as the average score 
of an infinite series of measurements with the same or exactly equivalent 
tests, assuming no practice effect or change in the examinee during 
the testings. 


validity. The extent to which a test does the job for which it is used. Validity, 
thus defined, has different connotations for various kinds of tests and, 
accordingly, different kinds of validity evidence are appropriate for them. 
For example: 


(1) The validity of an achievement test is the extent to which the con- 
tent of the test represents a balanced and adequate sampling of the 
outcomes (knowledge, skills, etc.) of the course or instructional program 
it is intended to cover (content, face, or curricular validity). It is best 
evidenced by a comparison of the test content with courses of study, 
instructional materials and statements of instructional goals, and by 


critical analysis of the processes required in responding to the items 


(2) The validity of an aptitude, Prognostic, or readiness test is the 
extent to which it accurately indicates future learning success in the агеа 


Appendix E 639 


for which it is used as a predictor (predictive validity). It is evidenced by 
correlations between test scores and measures of later success. 

(3) The validity of a personality test is the extent to which the test 
yields an accurate description of an individual's personality traits or per- 
sonality organization (status validity). It may be evidenced by agreement 
between test results and other types of evaluation, such as ratings or clinical 
classification, but only to the extent that such criteria are themselves valid. 


The traditional definition of validity as "the extent to which a test measures 
what it is supposed to measure," seems less satisfactory than the above, 
since it fails to emphasize that the validity of a test is always specific to 
the purposes for which the test is used, and that different kinds of evidence 
are appropriate for appraising the validity of various types of tests. 
Validity of a test пет refers to the discriminating power of the item—its 
ability to distinguish between persons having much and those having little 
of some characteristic, 


Ability 
developed versus innate, 181 
tests of, 154 
see also Aptitude 
Achievement 
definition of, 154, 182 
interindividual differences in, 459 
intraindividual differences in, 459 
Achievement tests, 437 ff., 496-497 
development of, 428-432 
for elementary and junior high 
school, 440—444, 496 
for high school, 444—450, 496—497 
performance, 402 
results from, 450-454 
selection of, 437 ff. 
uses of, 432-437 
Adams, Georgia Sachs, 419 
Adjustment. See Personal-social ad- 
justment 
Adkins, D. C., 230 
Administration 
of evaluation program, 488, 499— 
504 
problems in school, 9-14 
of tests, 166, 500—503 
Aesthetic Perception Test, 209 
Age 
chronological, 62 
educational, 56 
mental, 56, 184 
reading, 56 
Age scores, 55—56 
Allport, G. W., 241, 297 
Alschuler, Rose H., 316 
American Psychological Association, 
5, 167, 184, 237 


Index 


American Textbook Publishers Insti- 
tute, 61 
Analysis, as educational goal, 365, 
386-388 
Anastasi, Anne, 40, 312 
“Anchor test,” 421, 530, 531 
Anderson, H. A., 423 
Anderson, H. C., 243 
Anderson Chemistry Test, 369 
Anecdotal record, 273, 508 
Annual Review of Psychology, 175 
Anticipated Achievement Grade Place- 
ments (AAGP), 440, 450, 
465, 466, 468, 469 
Application, as educational goal, 365, 
383-386 
Aptitude 
as combination of abilities, 182-183 
definition of, 181 
interests and, 230-232 
learning difficulties and, 476 
measurement of, 181 ff. 
unitary, 182—183 
Aptitude tests, 58, 115, 122, 130, 154, 
495-496 
in art, 209 
batteries, 183, 194—204 
compared with multiscore intel- 
ligence tests, 198—203 
interpretation of, 219—220 
for clerical skills, 208 
development of, 183-186 
for foreign languages, 211 
interpretation of results from, 214— 
222 
for manual dexterity, 205, 208 
for mathematics, 211 


641 


642, 


Aptitude tests (Сот.) 
for music, 209—210 
paper and pencil, 205 
performance, 205 
power vs. speed, 200 
predictions resulting from, 222 
prognostic, 210-212, 234 
purpose of, 21 2-214 
results from, versus achievement test 
results, 450—452 
for shorthand, 211—212 
special, 204-210 
use of results from, 220-221 
Area transformation, 38 
Arithmetic 
diagnostic tests in, 472, 478 
grade norms in, 51 
Army Alpha, 184, 185 
Army Beta, 185 
Arny, Clara Brown, 176 
Art, aptitude in, 209 
Arthur Point Scale of Performance 
Tests, 184 
Athletics, evaluations in, 417 
Attitudes 
definition of, 248 
expressed, 250-251 
inventory of, 250 
manifest, 248—249 
role of teacher in observing, 249 
measurement of, 248-253 
scales for determining, 251—253 
Authoritarianism, measurement of, 137 
Autobiography 
interpretation of, 269 
personal-social adjustment and, 268— 
270 
Average. See Mean 


Batteries, test 
achievement, 440-450 
aptitude, 183 ff. 
Behavior 
abnormal, 260—261 
adaptive, 260 
criterion 
direct versus indirect measure- 
ment of, 150-151 


Index 


intermediate, 105—106 
tests as measures of, 103-106 
tests as predictors of, 105—106 
ultimate, 105-106 
evaluation of, 270-276 
influenced by attitudes, 248-250 
normal, 259 
Observation of, 263—265, 270-276 
relevant to personal-social adjust- 
ment, 263—264 
traits, 277-278 
Behavior Preference Record, 493 
Bennett, George K., 222 
Bennett Test of Mechanical Compre- 
hension, 204 
Billett-Starr Youth Problems Inven- 
tory, 311 
Binet, Alfred, 184 
Blair, Glenn M., 175 
Bloom, Benjamin S., 363, 394 
Bond, Guy L., 175 
Bonney, М. E., 286 
Brownell, William A., 324, 325 
Brueckner, Leo J., 472 
Buhler, Charlotte, 258 
Buros, Oscar К, 174 
Buros Yearbooks, 174, 204, 370, 437, 
440, 444 


California Achievement Test (CAT), 
51, 52, 54, 62, 169, 437, 
438, 439, 440, 444, 450, 
465, 473 

California F Scale, 137 

California Personality Test (CPT), 
302, 308 

California Psychological Inventory 
(CPI), 303, 309 

California Study Methods Survey, 492 

California Test Bureau, 168 

California Test of Mental Maturity, 
192, 438 

California Tests in Social and Related 
Sciences, 345, 433, 492, 493 

Campbell, Donald T., 137 

Carroll, John В., 211 

Cattell, Raymond B., 297, 298, 305, 
316 


Index 643 


Checklist 
definition of, 407 
problems, 310-311, 498 
self-rating, 492 
teachers', 406, 422, 492 
for test administrators, 503 
Chicago Test of Primary Mental Abili- 
ties, 199, 230 
Clark, Kenneth E., 244 
Clarke, H. Harrison, 176, 419 
Clerical aptitude, 208-209 
Clinical approach, 262, 311-316 
versus psychometric approach, 261— 
262, 295-296 
Coefficient 
of correlation, 74-83 
of equivalence, 86, 96 
of reliability, 83-97 
of validity, 108, 120, 122-123, 145 
College Entrance Examination Board 
(CEEB), 172, 173, 174, 
413, 445 
tests of, 45, 352, 454 
Committee on Gifted Students, 12 
Completion questions, 334-336 
Comprehension, as educational goal, 
365, 378 
extrapolation and, 381—382 
interpretation and, 381 
translation and, 378-380 
Conference 
parent-teacher, 516-520 
pupil-teacher, 265-268 
Constructs, see Traits 
Converted scores, 17, 28-33 
compared with perfect, 18 
methods for obtaining, 20 ff. 
scaled, 56 
types of, 44-49 
use of, 63-65 
see also Age scores, Grade scores, 
Percentile scores, Stanine 


Scores 
Cooperative English Tests, 372, 446, 
525 
Cooperative Foreign Language Tests, 
447 


Cooperative French Tests, 372 


Cooperative General Achievement 
Tests, 446 
Cooperative Mathematics Tests, 446 
Cooperative Social Studies Tests, 447 
Cooperative Study in General Educa- 
tion, 432 
Cooperative Test Division, 57, 446 
Corrective instruction, 480—482 
materials for, 481 
program for, 482 
Correlation 
coeflicient of, 74—83, 120 
bi-serial, 354-355 
Pearson product-moment method 
of, 77—78, 232, 624-626 
reliability and, 73—74 
Spearman rank-difference meth. 
od of, 75-76 
interpretation of, 79-83 
multiple, 547—548 
prediction and, 179-182, 547-549 
standard error of estimate and, 82 
variance and, 82, 132 
Correlation matrix, 132, 133 
Counseling. See Guidance 
Crary American History Test, 433 
Criteria 
behavior, 103-106, 150-151 
of evaluation objectives, 391—392 
for test selection, 157 ff., 324, 376 
Crites, John O., 175, 200, 208, 232, 
241, 254, 539 
Cronbach, Lee J., 263, 302, 308 
Cultural background 
and achievement test, 450 
and intelligence test, 217-218 
Cumulative-record system 
characteristics of, 509-510 
definition of, 507 
preparation of, 510-511 
purposes of, 508 
use by teachers of, 510, 511—512 


D-statistic, 550—552 

Darley, J. G.. 232, 236 

Data, measurement and evaluation 
explanation of, 4 
from achievement test, 450—454 


644 


Data, measurement and evaluation 
(Cont.) 
combining and weighting of, 522— 
523 
importance of using local, 123 
interpretation of, 17 ff., 538-543 
use of expectancy tables in, 121, 
122, 546 
see also Converted scores, Relia- 
bility, Validity 
on personal-social adjustment, 263 
ff. 
relevancy of, 3, 4, 6 
reporting of, 512-521 
effectiveness of, 519—521 
functions of, 512-514 
types of, 514—519 
sociometric. See Sociometric data 
student reaction to, 540 
summarizing and recording of, 507— 
512 
use of, in guidance, 534, 541 ff. 
Davis, Allison, 217 
Davis Reading Test, 162 
Davis-Eells intelligence tests, 217-218 
Decile point, 33 
Decile rank, 33 
Delinquents, 261 
Design Judgment Test, 209 
Diagnosis. See Educational diagnosis 
Diagnostic Analysis, 471 
Diagnostic Examination in Reading 
Abilities, 473 
Dictionary of Occupational Titles, 240 
Dictionary of Psychology, 181 


Diederich, Paul P., 41, 355, 406, 422 
Differences 


interindividual, 459 
intraindividual, 459, 546 
Differential Aptitude Tests, (DAT), 
163, 197, 199, 200, 220, 
221, 230, 231, 550, 552 
compared with General Aptitude 
Test Battery, 200-203 
Differential predictions, 546—549 
vocational guidance and, 549 
Distribution curve. See Normal distri- 
bution curve 


Doppelt, Jerome E., 222 

Drake Musical Aptitude Test, 210 

Draw-a-Person-Test, 314 

Dressel, Paul L., 63, 325, 329 

Driscoll Play Kit, 314 

Dunning, Gordon M., 346 

Durost, Walter N., 450 

Durrell Analysis of Reading Difficulty, 
475 

Dvorak, Beatrice, 200 

Dyer, Henry S., 7 


Ebel, Robert L., 359, 360, 517 
Education, objectives of, 363 ff. 
Educational diagnosis, 458 ff. 
by group analysis, 478-480 
levels of, 458, 462 
steps in, 463-477 
Educational and Psychological Meas- 
urement, 175 
Educational Testing Service, 99, 135, 
168, 176, 360, 413, 446 
Edwards Personal Preference Schedule 
(EPPS), 309, 310 
Eells, Kenneth, 217 
Ellis, Albert, 301 
Elsbree, Willard S., 514 
Emotions, learning ability and, 477 
Environment 
cultural. See Cultural background 
socioeconomic, 556 
Equivalence 
coefficient of, 86 
measurement of, 90 
Equivalent-forms method, 86 
Error 
amount of, 4 
compensating, 68, 71, 72 
in measurement, 15, 65, 69 ff. 
sources of, 4, 69—70 
standard. See Standard error 
systematic, 69, 71 
Essay writing, evaluation of, 413, 
415-416, 423-424 
Essential High School Content Bat- 
tery, 445, 446 
Evaluation 
definition of, 5 


Index 645 


Evaluation (Cont.) 
as educational objective, 366, 391— 
394 
effective teaching and, 461 
of personal-social adjustment, 257 
ff. 
problems in, 9—14 
program of, 487 ff. 
administration of, 499—504 
characteristics of, 489—490 
disadvantages of, 454—455 
functions of, 488—489 
guidance and, 488-489 
instruction and, 488 
objectives of, 491 
planning of, 490-491, 495—499 
scheduling of, 498—499 
supervision of, 488 
teacher and, 487 
techniques used in, 492-493 
relation to instruction, 325—326 
relation to measurement, 6 
of skills, 401 ff. 
teacher's role in, 8-9, 359-360 
Evaluation and Adjustment Series, 57, 
169, 447, 448 
Examinations. See Tests 
Expectancy chart, 469, 546 


F-score, 303 

Factor analysis, 132-137, 147, 195, 
239 

Fifth Mental Measurements Yearbook, 
312 

Findley, Warren G., 452 

Flanagan Aptitude Classification Tests, 
221 

Frederiksen, Norman, 245 

French, John W., 424 

Frequency distribution, 22, 29, 30, 42 

Frequency polygon, 35, 37 

Froehlich, Clifford P., 175 

Fruchter, Benjamin, 136 


Gage, №. L., 333 

Gardner, Eric F., 290 

Gates Diagnostic Reading Tests, 474 
Gates Reading Readiness Tests, 211 


General Aptitude Test Battery (Сб ATB), 
200-204, 205, 221, 553 
compared with Differential Aptitude 
Tests, 201—203 
General Clerical Test, 58, 208 
Gerberich, Raymond, 368 
Gifted students, 12, 138, 435 
Gilmore Oral Reading Test, 475 
Goals. See Objectives 
Goldman, Leo, 175 
Gough, Harrison C., 309 
Grade placement norms, 20, 165, 439 
Grade scores, 43, 50-55 
problems in interpreting, 53-55 
Grading 
centralized control of, 524 
comparability of, 524, 528 
in honors classes, 525, 527, 528 
reliability of, 521—522 
standard policy for, 530-531 
validity of, 521—522 
variations in practices of, 10—11 
Graves, Maitland, 209 
Greene, Edward B., 248 
Gronlund, Norman E., 286, 287 
Guidance 
educational, 13, 119, 197, 543, 545 
evaluation program and, 488—489, 
494, 495 
functions of, 534 
role of teacher in, 535, 536-538 
role of trained personnel in, 536 
use of interest inventories in, 228, 
236, 239 
use of test data in, 541—543 
vocational, 13, 221, 543, 545 
personality inventory in, 301 
study of personality in, 261—262 
use of achievement tests in, 437 
use of differential predictions in, 
546-547 
use of interest inventory in, 246— 
248 
use of profile in, 550 
Guilford, J. P., 241, 297, 298 
Guilford-Schneidman-Zimmerman In- 
terest Survey, 239 


646 


Haganah, Theda, 232, 236 

“Halo effect," 280, 334, 420, 522 

Handwriting, evaluation of, 41 1-412 

Hardaway, Mathilde, 176 

Hattwick, L. W., 316 

Hawkes, Herbert E., 325 

Henmon-Nelson Tests of Mental Abil- 
ity, 61 

Histogram, 36 

History, testing of, 108, 109, 110 

Holtzman Ink Blot Technique, 312 

Home economics, evaluation in, 417 

Hopkins, Kenneth D., 443 


Index of forecasting efficiency, 120 
Individualized instruction 
and educational diagnosis, 458 
effective teaching and, 461 
need for, 461—462 
Industrial arts, ratings in, 416-417 
Instruction 
corrective, 480—482 
materials for, 481 
program for, 482 
emphasis on specific knowledge in, 
328-329 
evaluation and, 325—326, 488 
individualized, 458 
need for, 461—462 
improvements in, 321 ff. 
long-range objectives in, 328-329 
Instructional tests, 323-325 
criteria for, 324 
Intelligence quotient 
computation of, 214—216 
constancy of, 216 
origin of, 184 
reasons for variability in, 218—219 
Intelligence tests, 99—100, 119, 130, 
132, 138 
development of, 183-193 
effect of cultural background on 
results from, 217 
group versus individual, 194 
multiscore, 198 
compared with aptitude test bat- 
teries, 198—203 


Index 


nonverbal, 192-194 
origin of, 183-184 
purpose of, 212-213 
reasons for varying results in, 218— 
219 
variations in content of, 218 
Interest inventory, 228, 229, 230 
based on empirical study, 234-237 
based on factor analysis, 239 
based on unitary traits, 237-238 
basic interest groups, 241—242 
interpretation of results from, 246— 
248 
occupational, 240 
predictions from 
on academic achievement, 241 
on occupational satisfaction, 243— 
244 
for vocational training, 244-246 
types of, 233-240 
validity of, 241—246 
Interests 
aptitudes and, 230-232 
expressed, 229 
factors affecting development of, 
228-229 
manifest, 229 
measurement of, 228 ff., 497 
methods of obtaining data on, 229- 
230 
Stability of vocational, 233 
tested, 229 
Internal-consistency method, 86-87, 
91, 93, 94 
Interval scales, 64 
Interview, with pupil 
importance of rapport in, 266-267 
interpretation of data from, 267- 
268 
9n personal-social adjustment, 265— 
268 
Preparations for, 266 
Inventory 
of attitudes, 250 
Interest. See Interest inventory 
personality, 297 ff, 
Iowa Algebra Aptitude Test, 211 


Index 


Iowa Tests of Basic Skills, 440, 492, 


503 

Iowa Tests of Educational Develop- 
ment (ITED), 444, 445, 
513, 550 


Item analysis, 354—357, 478-479 


K-score, 56, 303 
Karnes, M. Ray, 176, 416 
Katz, Martin, 357 
Kaulfers, W. V., 373 
Kawin, Ethel, 508 
Kelley, T. L., 96 
Kent-Rosanoff Free Association Test, 
312 
Klausmeier, Herbert J., 536 
Knowledge 
of abstractions, 376-377 
of categories, 375 
of conventions, 371—373 
of criteria, 375-376 
definition of, 367 
as educational objective, 364 
performance in skills and, 401 
of specifics, 368-371 
of terminology, 368-370 
of trends and sequences in, 373-374 
of universals, 376-377 
Kreidt, P. H., 237 
Kruglak, H., 405 
Kuder, G. Frederic, 230, 234 
Kuder Occupational, Form D, 239 
Kuder Preference Record, 169, 230, 


231, 233, 236, 240, 254, 
550, 552 

Kuder Preference Record, Personal, 
239 


Kuder Preference Record, Vocational, 
238, 247, 310 

Kuder-Richardson method, 87, 88, 90, 
91, 93, 94, 172, 305 

Kuhlmann-Anderson Intelligence Test, 
503 


Languages, foreign 
achievement tests for, 447 
aptitude tests for, 211 


647 
Learning difficulties 

causes of, 475—477 

corrective instruction for, 480—482 

determining, 471—475 

emotional factors and, 477 
Lee, E. A., 240 
Likert, R., 252 
Likert method of attitude-scale con- 

struction, 252—253 

Lindsay, Alexander D., 388 
Linear standard scores, 35, 43 
Linear transformations, 35, 38 
Loevinger, Jane, 308 
Lorge-Thorndike | Intelligence 
99, 130, 131, 192 


Tests, 


McArthur, C., 236 
McCully, C. Harold, 244 
Machover, Karen, 314 
Maier, Thomas, 176 
Manual dexterity, aptitude in, 205, 
208 
Marston, William, 297 
Matching questions, 342-345, 368 
Mathematics 
aptitude tests for, 211 
grading in, 529-530 
prognostic tests in, 211 
testing of, 446 
Maturity, in personality development, 
258-259 
Maurer, Katherine M., 115 
Mean, 22 
computation of, 19, 24, 25; 27, 623 
Measurement 
of aptitudes. See Aptitudes 
data from. See Data 
definition of, 5 
direct versus indirect, 150-153 
educational diagnosis and, 458—462 
effective teaching and, 461 
principles of, and test selection, 
149 ff. 
problems in, 7-14 
relation to evaluation of, 6 
see also Evaluation, Tests 
Meier Art Tests, 204, 209 


648 


Melville, Donald S., 245 

Mental abilities tests. See Intelligence 
tests 

Mental age, 56, 184, 214 

computation of, 215 

Mental health. See Personal-social ad- 
justment 

Mental Measurement Yearbooks, 174, 
175 

Metropolitan Achievement Test, 62, 
169, 440, 444, 464 

Metropolitan Readiness Tests, 211 

Micheels, William J., 176, 416 

Michigan Vocabulary Profile Test, 
229 

Minnesota Clerical Test, 59, 204, 208 

Minnesota Counseling Inventory 
(MCI), 309 

Minnesota Multiphasic Personality In- 
ventory (MMPI), 303, 304, 


309 

Minnesota Paper Formboard, 204, 
205, 206 

Minnesota Rate of Manipulation Test, 
208 

Minnesota Spatial Relations Test, 205, 
206 


Modern Democratic State, The, 388 

Modern Language Aptitude Test, 211 

Mooney Problem Check Lists, 310 

Multiple-choice questions, 339-342, 
351, 368 

Mutiple regression equation, 547 

Multitrait-multimethod matrix, 138 
140 

Murphy, Gardiner, 311—3 12 

Murray, Henry А., 310 


Music 
aptitude in, 209-210 
skill in, 421 


Musical Aptitude Test, 210 


National Guidance Testin 
168 

Neurotic, 260-261 

Nominal numbers, 63 


2 Program, 


Index 


Normal behavior, 259 
Normal distribution curve, 22, 27 
characteristics of, 34—36 
Normalized standard scores, 34—40, 
43 
relationship between types of, 43, 
617 ff. 
Norming process, of testing, 
definition of, 150 
see also Standardized tests 
Norms, 158, 172 
age, 20, 55-56 
grade placement, 20, 50-55, 165, 
` 439 
homogeneous versus population in 
general, 57-58 
importance of, 57, 59 
item, 433, 453-454 
local, 19 
modal-age grade, 62 
national, 60, 62, 165, 166 
percentile, 165, 166, 439 
procedures for obtaining, 60 
see also Converted scores 
Number Systems, types of, 63 


Objectives 
and instruction, 328—329 
taxonomy of educational, 363 ff. 
Objective tests, 330, 332 
Objectivity 
in evaluation, 460 
of teacher-made tests, 331-332 
in scoring, 167—168, 422, 429, 430, 
522 
Observation 
of attitudes, 248—251 
of behavior, 270—276 
informal, 270-272 
recording data from, 272-274 
by situational tests, 274—276 
Systematic, 270 
O'Connor Finger and Tweezer Dex- 
terity Tests, 208 
Occupational Interest Inventory, 240, 
241 


index 


Oden, M. H., 245 

Ogive, 28 

Ordinal scales, 63 

Orleans Algebra Prognosis Test, 211 

Otis Normal Percentile Chart, 39, 40, 
4l 

Quick-Scoring Mental Ability 
Tests, 219, 469 


Otis 


Paranoid, 304 
Parents 
school records and, 508 
teacher conference with, 516-519, 
520 
Peabody Library Information Test, 
492 
Pearson product-moment method, 77— 
78, 232, 624-626 
Peer-nomination technique, 139, 276, 
289-291 
Percentile point, 33 
Percentile rank, 18, 33, 39 
Percentile scores, 28 
advantages of, 30 
computation of, 32 
disadvantages of, 30, 32 
Performance tests 
disadvantages of, 404—405 
scoring of 
by process, 405—406, 408, 411 
by product, 405-406, 409, 411-. 
412 
by ranking, 406-407, 422 
selection of, 403-404 
standardization of conditions for, 
404 
types of, 402 
see also Achievement tests, Skills 
Personal and Social Development Pro- 
gram, 271 
Personal-social adjustment 
criteria of, 261 
definition of, 259 
evaluation of 
difficulties in, 257 


649 

by informal observation, 270— 
274 

by opinions of others, 264—265, 
276 ff. 

by projective tests, 264, 295-296, 
311, 315 

by self-report, 263, 265-269 

by situational tests, 274—276 

by sociometric techniques, 281— 
289 

by systematic observation, 270 


internal conflicts and, 260 

major concepts of, 258-259 

peer judgment of, 289-291 
teacher rating scales for, 276-280 

Personality 
clinical study of, 262, 311-316 
description of, 261—263 
psychometric study of, 261—262 
see also Personal-social adjustment 

Personality development 
maturity in, 258-259 
normality in, 259 
study of, 257 ff. 

Personality inventory, 297 1. 
disadvantages of, 299-301 
interpretation of results from, 306- 

311 
problems checklist as, 310-311 
reliability of, 305-306 
validity of, 301-305 

Personality traits 
homogeneous, 297 
source, 298 
surface, 297 

Personnel and Guidance Journal, 175 

Personnel Psychology, 175 

Physical Science Study Committee, 

447 
Pintner intelligence tests, 192, 469 
Pintner-Paterson Performance Scale, 
184 

Prediction equations, 547—549 

Prescott, George A., 450 

Primary Mental Abilities Test, 91 

Problem-solving questions, 345-347 


650 


Problems checklist, 310—311 
Product scale, 411—412, 413, 422 
Profile, 229, 238, 550 
analysis of, 556-559 
aptitude test, 196, 553 
preparation of, 555 
Project Talent, 61, 62 
Psychologist, evaluation program and, 


494 

Psychological tests. See Tests, psycho- 
logical 

Psychometric approach, 261—262, 
297 ff. 


vs. clinical approach, 295-296 
Psychotic, 260 
Purdue Pegboard, 208 


Rank-difference method, 75-76 
Ranking 
for grading, 406-407, 422 
teacher-made tests for, 322-323 
Rapport, 266-267 
Rating scale, 406-411, 422 
advantages of, 514 
design of, 278-280 
graphic, 278 
teacher, 276-280 
Ratio scales, 64 
Raw scores, 17 
converted to age and grade scores, 
43 ff. 
converted to normalized standard 
scores, 36, 37 
converted to percentile scores, 28 
Readiness 
for arithmetic, 211 
for reading, 210, 211 
Reading 
diagnostic tests in, 473-475 
remedial, 474 
Reading Comprehension Grade Place- 
ment (RCGP), 467 
Reading-readiness tests, 119, 153, 210, 
211, 495 
Redl, Fritz, 260 
Regression, 82 


Index 


Regression effect, 450 
Relevance, of tests, 103-106, 159 
Reliability, 68 ff., 158, 159, 163, 165, 
172 
coefficient of 
comparison of standard 
with, 83 
definition of, 73 
factors affecting size of, 93-95 
methods of obtaining, 84-92 
standards for, 95-96 
correlation and, 74 
definition of, 68 
of difference Scores, 92, 163 
in evaluating skills, 421—424 
homogeneity of sample and, 90, 94 
methods of estimating, 93 
methods of increasing, 97-99 
minimum coefficient of, 96 
of personality inventories, 305-306 
of teacher-made tests, 322-323 
validity and, 103 
Remmers, Н. H., 333 
Report cards, 512-515 
Review of Educational Research, 175 
Rice, J. M., 429 
Rimland, Bernard, 300 
Rorschach Ink Blot Test, 312, 313 
Rosenzweig Picture-Frustration Study, 
313 
Rothney, John W. M., 175, 541 
Rotter Incomplete Sentence Test, 313 


error 


Sample 
homogeneity of, 90 
Tandom, 104 


Validity and, 107 
Tepresentativeness of, 5, 6, 15, 60 
_ Validity and, 108-109 
size of, 6, 97, 104, 422 
In tests, 4 
Sapon, W.,211 
Sarbin, Т. R., 243 
Sax, Gilbert, 443, 519 
Scannell, Dale P., 525, 529 
Scatter diagram, 79, 80 


Index 


Scholastic Aptitude Test (SAT), 170, 


171, 172, 217 

School and College Ability Tests 
(SCAT), 168, 171, 438, 
439 

Schools, administrative problems of, 
9-14 

Science, in achievement tests, 438, 
446 

Science Research Associates (SRA) 
Achievement Series, 54, 134, 
141, 473 

Science Research Associates (SRA) 
Junior Inventory, 311 

Science Research Associates (SRA) 
Youth Inventory, 310 

Scores 

ability, 553 


composite, 522—525 
difference, 92 
expectancy, 62 
perfect, 18 


true, 71, 72 

see also Converted scores, Age 
scores, Grade scores, Nor- 
malized standard scores, 
Percentile scores, Raw 
scores, Standard scores, t- 


scores, Test scores, z-scores 
Scoreze, 168, 439, 471 
Scoring 
ease of, 167-168 
of essay tests, 330, 334 
in evaluative process, 7 
interpretation of, 169 
of objective tests, 330, 332 
objectivity in, 97-98, 167-168, 422, 
429, 430, 522 
of performance tests, 405-412 
of standardized tests, 503—504 
Seashore Measures of Musical Talents, 
204, 209 
Segel. David, 232 
Self-report techniques, 263, 265-269 
Sequential Tests of Educational Prog- 
ress (STEP), 99, 162, 163, 


651 


168, 169, 413, 438, 439, 
440, 444, 445 
Shorthand, aptitude tests for, 211—212 
Simon, Theodore, 184 
Skewed distribution, 27, 37 
Skills 
athletic, 417, 419—420 
basic, 476, 496 
communication, 412—416 
computational, 473 
evaluation of, 401 ff. 
knowledge and, 401 
manipulative, 416-417 
methods for testing, 403-404 
reliability in evaluating, 421—424 
validity in evaluating, 420—421 
work-study, 476 
Snellen Chart, 477 
Social studies 
in achievement test, 438—439, 447 
objective in learning, 328-329 
testing of, 373-374 
Sociogram, 283-286 
see also Sociometric techniques 
Sociometric techniques, 281—292 
administration of, 282-283 
interpretation of, 284—286 
peer judgments and, 289—291 
purpose of, 291 
problems in using, 291-292 
reliability of, 286-287 
role of teacher and, 282-283 
selection of questions in, 281—282 
validity of, 286-289 
Spearman rank-difference method, 75— 
76 
Spearman-Brown formula, 87, 622 
Spelling 
testing of, 98, 107, 152 
use of teaching machines for, 11 
14 
Spencer, Douglas, 297 
Spitzer Study Skills Test, 492 
Split-halves method, 86, 93, 305 
Stability, measurement of, 90 


, 


652 


Standard deviation, 20, 22 
basic formula for, 71 
computation of, 24, 71 
use of, 21 
variance and, 71 

Standard error, 73 
reliability coefficients 

with, 83 

Standard error of estimate 
correlation and, 82 
predictive validity coefficient and, 

120 

Standard scores, 20 ff., 617—621 
see also t-scores, z-scores 

Standardized tests 
administration of, 500—502 
characteristics of, 149-150 
criticisms of, 428, 430 
evaluation of, 157-166, 170-174 
preparation for, 500 
scheduling of, 501—502 
scoring of, 503—504 
selection of, 493—494 
uses of, 432-437, 460 

Stanford Achievement Test, 51, 54, 

56, 61, 62, 429, 443, 444 

Stanford-Binet Scales, 45, 98, 131, 

184, 185, 186-192 
compared with Wechsler Intelli- 
gence Tests, 186-190 
Stanine scores, 41, 42, 464, 517, 523 
advantages of, 42, 43 
computation of, 41 
Stern, William, 184 
Stevens, Lucia B., 236 
Stone Reasoning Test in Arithmetic, 
429 

Strong, E. K., 233, 234, 243, 244 

Strong Vocational Interest Blank 
(SVIB), 235-237, 239, 241, 
244, 245, 247, 304 

Super, Donald E., 175, 200, 208, 241, 
254, 539 

Sweden, grading system in, 524 

Survey of Study Habits and Attitudes, 
492 


compared 


Index 


as educational 
366, 388-391 
zondi Test, 314 


Synthesis, objective, 


T-scores 
comparison of z-scores with, 27 
computation of, 27 
T-scaled scores, 38, 39, 43 
Taxonomy, of educational objectives, 
363 ff. 
Taylor, K., 233 
Teacher-made tests, 321 ff., 492 
analysis of results from, 354—357 
content validity of, 325-328 
content versus goal emphasis in, 
328-330 
directions for, 350-353 
editing of, 349-350 
evaluation of, 347-349 
grouping of items in, 350 
improvement of, 357-360 
for instructional purposes, 323-325 
make-up of, 353 
preparation for, 349-354 
for ranking students, 322-323 
requirements for, 322 
Teachers 
educational diagnosis and, 459 
evaluation program and, 321 ff., 487 
interview with pupil and, 265-268 
observation of behavior by, 265- 
267, 270-276 
parent conference with, 516—519, 
520 
rating scales of, 276-280 
role in guidance, 535, 536—538 
role during testing, 502-503 
study of personal-social adjustment 
by, 257, 264 
use of achievement tests by, 432- 
433 
use of cumulative records by, 510, 
511-512 
use of projective techniques by, 315 
use of sociometric techniques by, 
282-283 
use of standardized tests by, 460 


Index 


Teaching machines, 11, 14 
"Technical Recommendations for 
Achievement Tests," 55 
Technical Recommendations for Psy- 
chological Tests and Diag- 
nostic Techniques, 106, 119, 
147 
Terman, Lewis M., 184, 245 
Terman-McNemar Group Test ој 
Mental Ability, 216, 469 
Test items 
completion, 334—336 
essay, 330, 331, 333, 334, 388 
grouping of, 350 
matching, 342-345, 368 
multiple-choice, 339-342, 351, 368 
problem solving, 345-347 
true-false, 337-339, 351 
Test-retest method, 85-86, 305 
Test scores 
effect of anxiety on, 172 
effect of coaching on, 171 
effect of cultural background on, 
172 
effect of fatigue on, 171 
effect of practice on, 131 
reliability of, 97-100 
sources of variance in, 69—70 
Testing 
analysis and, 110 
application and, 110 
comprehension and, 109 
evaluation and, 110 
knowledge and, 109 | 
Testing program. See Evaluation, pro- 
gram of 
Tests 
abilities, 154 
achievement. See Achievement tests 
administration of, 98, 150, 158, 
166-170 
aptitude. See Aptitude tests 
classification of 
based on content, 154 
based on procedure, 153-1 54 
by degree of indirectness, 150- 


153 


653 


cost of administering, 169-170 
diagnostic, 471—472 
essay, 330, 331, 333-334, 388 
group, 153 
"identical-element," 152 
instructional, 323-325 
intelligence. See Intelligence tests 
length of, 97 
as measurement of criterion ђе- 
havior, 103-105 
moral knowledge, 138 
objective. See Objective tests 
pencil-and-paper, 153 
performance. See Performance tests 
power, 153 
prognostic, 210—212, 234, 496 
projective, 264, 295-296, 311, 315 
psychological, 5 
purposes of, 103, 106-107 
reading-readiness, 119, 153, 
211 
“related behavior," 152, 402 
scores of. See Test scores 
selection of, 11—12, 14, 99 
self-descriptive, 156 
situational, 274—276 
space and number, 91 
speed, 153 
standardized. See Standardized tests 
teacher-made. See  Teacher-made 
tests 
“verbalized behavior," 153 
verbal reasoning, 91—92 
vocabulary, 98 
"work-sample," 152, 402 
Tests of Primary Mental Abilities, 
198, 199 
Thematic Apperception Test (TAT), 
313 
Thompson, Anton, 503 
Thompson, George C., 290 
Thorndike, E. L., 138, 411, 429 
Thorndike Handwriting Scale, 429 
Thorpe, Louis P., 210, 240 
Thurstone, L. L., 194, 198, 251 
Thurstone method of attitude-scale 
construction, 251—253 


210, 


654 


Tomkins-Horn Picture Arrangements 
Test (PAT), 314 

Torgerson, T. L., 419 
Traits 

definition of, 128, 129 

personal, 277 

social, 277 

validity of, 137 

see also Personality traits 
Traxler, A. E., 423 
Treacy, John P., 473 
True-false questions, 337-339, 351 
Tryon, Caroline, 289 
Tyler-Kimber Study Skills Test, 492 


United States Bureau of the Census, 


61 
United States Employment Service, 
221, 240 


Validity, 103 ff., 159 
coefficient of, 120-123, 162 
concurrent, 107, 114—119, 130, 146, 
157, 160, 162, 171, 302, 304 

construct, 107, 128—143, 161, 162, 
302 

content, 107-114, 130, 146, 157, 
160, 171, 325 

convergent, 138, 145 

definition of, 103, 145 

discriminant, 138, 145 

in evaluating skills, 420-421 

face, 160 

factorial, 134, 147 

of interest inventories, 241 

of personality inventories, 301-305 

predictive, 107, 119—128, 130, 146, 
157, 160, 162, 171, 304 

relevance and, 103-106 

reliability and, 103 

Variance 
computation of, 71 
correlation and, 82 
error, 72, 132 
lasting general, 69-70, 84-85 
lasting specific, 70, 84-85 
sources of, 69-74 


Index 


standard deviation and, 71 
temporary general, 70, 84-85 
temporary specific, 70, 84-85 
total, 72 
true, 72, 73 
Verbal-educational factor, in testing, 
156, 157 
Vernon, P. E., 241 
Veterans’ Administration, vocational 
guidance of, 545 
Vocabulary, testing of, 98 
Vocation 
measurement of interests in, 236- 
237 
stability of interests in, 233 
see also Guidance 


Wagner, Eva Bond, 175 

Wattenberg, William W., 260 

Wechsler, David, 185 

Wechsler Adult Intelligence 
(WAIS), 186 

compared with Stanford-Binet 

Scales, 186-190 

Wechsler Intelligence Scale for Chil- 
dren (WISC), 186 

compared with Stanford-Binet 

Scales, 186-190 

Wechsler-Bellevue Scale for Adoles- 
cents and Adults, 185, 186 

Wesman, Alexander G., 231 

West, J. Y., 249 

Wetzel “grid,” 477 

Whistler, Harry S., 210 

Whitney, А, Р., 287 

Wiener, Daniel N., 302 

Wing Standardized Tests of Musical 
Intelligence, 210 

Wrenn Study Habits Inventory, 492 

Wrinkle, William, 491, 514, 520 


Scale 


Z-SCOres 
Comparison of t-scores with, 27 
Computation of, 23, 26, 38 
Pearson product-moment method 
“Sa, and, 77-78 


